openbmb / minicpm-v Goto Github PK

View Code? Open in Web Editor NEW

8.0K 75.0 558.0 191.56 MB

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

License: Apache License 2.0

Python 98.37% Shell 1.63%

minicpm minicpm-v multi-modal

minicpm-v's Introduction

A GPT-4V Level Multimodal LLM on Your Phone

中文 | English

Join our 💬 WeChat

MiniCPM-Llama3-V 2.5 🤗 🤖 | MiniCPM-V 2.0 🤗 🤖 | Technical Blog

MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include:

MiniCPM-Llama3-V 2.5: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. Equipped with the enhanced OCR and instruction-following capability, the model can also support multimodal conversation for over 30 languages including English, Chinese, French, Spanish, German etc. With help of quantization, compilation optimizations, and several efficient inference techniques on CPUs and NPUs, MiniCPM-Llama3-V 2.5 can be efficiently deployed on end-side devices.
MiniCPM-V 2.0: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in low hallucination rates.

News

📌 Pinned

[2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and ollama! Please pull the latest code of our provided forks (llama.cpp, ollama). GGUF models in various sizes are available here. MiniCPM-Llama3-V 2.5 series is not supported by the official repositories yet, and we are working hard to merge PRs. Please stay tuned!
[2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics here.
[2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click here to view more details.
[2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available here. Come and try it out!

[2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this link.
[2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it here!
[2024.05.24] We release the MiniCPM-Llama3-V 2.5 gguf, which supports llama.cpp inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!
[2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide efficient inference and simple fine-tuning. Try it now!
[2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click here to view more details.
[2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at here!
[2024.04.17] MiniCPM-V-2.0 supports deploying WebUI Demo now!
[2024.04.15] MiniCPM-V-2.0 now also supports fine-tuning with the SWIFT framework!
[2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog.
[2024.03.14] MiniCPM-V now supports fine-tuning with the SWIFT framework. Thanks to Jintao for the contribution！
[2024.03.01] MiniCPM-V now can be deployed on Mac!
[2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.

MiniCPM-Llama3-V 2.5
MiniCPM-V 2.0
Chat with Our Demo on Gradio
Install
Inference
Fine-tuning
TODO
🌟 Star History
Citation

MiniCPM-Llama3-V 2.5

MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:

🔥 Leading Performance. MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max and greatly outperforms other Llama 3-based MLLMs.
💪 Strong OCR Capabilities. MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
🏆 Trustworthy Behavior. Leveraging the latest RLAIF-V method (the newest technique in the RLHF-V [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a 10.3% hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. Data released.
🌏 Multilingual Support. Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from VisCPM, MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to over 30 languages including German, French, Spanish, Italian, Korean etc. All Supported Languages.
🚀 Efficient Deployment. MiniCPM-Llama3-V 2.5 systematically employs model quantization, CPU optimizations, NPU optimizations and compilation optimizations, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a 150x acceleration in end-side MLLM image encoding and a 3x speedup in language decoding.
💫 Easy Usage. MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) GGUF format quantized models in 16 sizes, (3) efficient LoRA fine-tuning with only 2 V100 GPUs, (4) streaming output, (5) quick local WebUI demo setup with Gradio and Streamlit, and (6) interactive demos on HuggingFace Spaces.

Evaluation

Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench.

Model	Size	OCRBench	TextVQA val	DocVQA test	Open-Compass	MME	MMB test (en)	MMB test (cn)	MMMU val	Math-Vista	LLaVA Bench	RealWorld QA	Object HalBench
Proprietary
Gemini Pro	-	680	74.6	88.1	62.9	2148.9	73.6	74.3	48.9	45.8	79.9	60.4	-
GPT-4V (2023.11.06)	-	645	78.0	88.4	63.5	1771.5	77.0	74.4	53.8	47.8	93.1	63.0	86.4
Open-source
Mini-Gemini	2.2B	-	56.2	34.2*	-	1653.0	-	-	31.7	-	-	-	-
Qwen-VL-Chat	9.6B	488	61.5	62.6	51.6	1860.0	61.8	56.3	37.0	33.8	67.7	49.3	56.2
DeepSeek-VL-7B	7.3B	435	64.7*	47.0*	54.6	1765.4	73.8	71.4	38.3	36.8	77.8	54.2	-
Yi-VL-34B	34B	290	43.4*	16.9*	52.2	2050.2	72.4	70.7	45.1	30.7	62.3	54.8	79.3
CogVLM-Chat	17.4B	590	70.4	33.3*	54.2	1736.6	65.8	55.9	37.3	34.7	73.9	60.3	73.6
TextMonkey	9.7B	558	64.3	66.7	-	-	-	-	-	-	-	-	-
Idefics2	8.0B	-	73.0	74.0	57.2	1847.6	75.7	68.6	45.2	52.2	49.1	60.7	-
Bunny-LLama-3-8B	8.4B	-	-	-	54.3	1920.3	77.0	73.9	41.3	31.5	61.2	58.8	-
LLaVA-NeXT Llama-3-8B	8.4B	-	-	78.2	-	1971.5	-	-	41.7	37.5	80.1	60.0	-
Phi-3-vision-128k-instruct	4.2B	639*	70.9	-	-	1537.5*	-	-	40.4	44.5	64.2*	58.8*	-
MiniCPM-V 1.0	2.8B	366	60.6	38.2	47.5	1650.2	64.1	62.6	38.3	28.9	51.3	51.2	78.4
MiniCPM-V 2.0	2.8B	605	74.1	71.9	54.5	1808.6	69.1	66.5	38.2	38.7	69.2	55.8	85.5
MiniCPM-Llama3-V 2.5	8.5B	725	76.6	84.8	65.1	2024.6	77.2	74.2	45.8	54.3	86.7	63.5	89.7

* We evaluate the officially released checkpoint by ourselves.

Evaluation results of multilingual LLaVA Bench

Examples

We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.

MiniCPM-V 2.0

Click to view more details of MiniCPM-V 2.0

MiniCPM-V 2.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.

🔥 State-of-the-art Performance.

MiniCPM-V 2.0 achieves state-of-the-art performance on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Notably, MiniCPM-V 2.0 shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding, and state-of-the-art performance on OCRBench among open-source models.
🏆 Trustworthy Behavior.

LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is the first end-side LMM aligned via multimodal RLHF for trustworthy behavior (using the recent RLHF-V [CVPR'24] series technique). This allows the model to match GPT-4V in preventing hallucinations on Object HalBench.
🌟 High-Resolution Images at Any Aspect Raito.

MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from LLaVA-UHD.
⚡️ High Efficiency.

MiniCPM-V 2.0 can be efficiently deployed on most GPU cards and personal computers, and even on end devices such as mobile phones. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with favorable memory cost and speed during inference even when dealing with high-resolution images.
🙌 Bilingual Support.

MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM [ICLR'24].

Examples

We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.

Legacy Models

Model	Introduction and Guidance
MiniCPM-V 1.0	Document
OmniLMM-12B	Document

Chat with Our Demo on Gradio

We provide online and local demos powered by HuggingFace Gradio, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.

Online Demo

Click here to try out the online demo of MiniCPM-Llama3-V 2.5 ｜ MiniCPM-V 2.0 on HuggingFace Spaces.

Local WebUI Demo

You can easily build your own local WebUI demo with Gradio using the following commands.

pip install -r requirements.txt

# For NVIDIA GPUs, run:
python web_demo_2.5.py --device cuda

# For Mac with MPS (Apple silicon or AMD GPUs), run:
PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps

Install

Clone this repository and navigate to the source folder

git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V

Create conda environment

conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V

Install dependencies

pip install -r requirements.txt

Inference

Model Zoo

Model	Device	Memory	Description	Download
MiniCPM-Llama3-V 2.5	GPU	19 GB	The lastest version, achieving state-of-the end-side multimodal performance.	🤗
MiniCPM-Llama3-V 2.5 gguf	CPU	5 GB	The gguf version, lower memory usage and faster inference.	🤗
MiniCPM-Llama3-V 2.5 int4	GPU	8 GB	The int4 quantized version，lower GPU memory usage.	🤗
MiniCPM-V 2.0	GPU	8 GB	Light version, balance the performance the computation cost.	🤗
MiniCPM-V 1.0	GPU	7 GB	Lightest version, achieving the fastest inference.	🤗

Multi-turn Conversation

Please refer to the following codes to run.

from chat import MiniCPMVChat, img2base64
import torch
import json

torch.manual_seed(0)

chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')

im_64 = img2base64('./assets/airplane.jpeg')

# First round chat 
msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

# Second round chat 
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

You will get the following output:

"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."

"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."

Inference on Multiple GPUs

You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this tutorial for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.

Inference on Mac

Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs).

# test.py  Need more than 16GB memory.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
model = model.to(device='mps')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
question = 'Where is this photo taken?'
msgs = [{'role': 'user', 'content': question}]

answer, context, _ = model.chat(
    image=image,
    msgs=msgs,
    context=None,
    tokenizer=tokenizer,
    sampling=True
)
print(answer)

Run with command:

PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py

Deployment on Mobile Phone

MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click MiniCPM-Llama3-V 2.5 / MiniCPM-V 2.0 to install apk.

Inference with llama.cpp

MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more detail. This implementation supports smooth inference of 6~8 token/s on mobile phones (test environment：Xiaomi 14 pro + Snapdragon 8 Gen 3).

Inference with vLLM

Click to see how to inference MiniCPM-V 2.0 with vLLM (MiniCPM-Llama3-V 2.5 coming soon)

Because our pull request to vLLM is still waiting for reviewing, we fork this repository to build and test our vLLM demo. Here are the steps:

Clone our version of vLLM:

git clone https://github.com/OpenBMB/vllm.git

Install vLLM:

cd vllm
pip install -e .

Install timm:

pip install timm==0.9.10

Run our demo:

python examples/minicpmv_example.py

Fine-tuning

Simple Fine-tuning

We support simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5.

Reference Document

With the SWIFT Framework

We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.

Best Practices：MiniCPM-V 1.0, MiniCPM-V 2.0

TODO

MiniCPM-V fine-tuning support
Code release for real-time interactive assistant

Model License

This repository is released under the Apache-2.0 License.
The usage of MiniCPM-V model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM are completely free for academic research. after filling out a "questionnaire" for registration, are also available for free commercial use.

Statement

As LMMs, MiniCPM-V models (including OmniLMM) generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V models does not represent the views and positions of the model developers

We will not be liable for any problems arising from the use of MiniCPMV-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Institutions

This project is developed by the following institutions:

Other Multimodal Projects from Our Team

👏 Welcome to explore other multimodal projects of our team:

VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V

🌟 Star History

Citation

If you find our model/code/paper helpful, please consider cite our papers 📝 and star us ⭐️！

@article{yu2023rlhf,
  title={Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback},
  author={Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and others},
  journal={arXiv preprint arXiv:2312.00849},
  year={2023}
}
@article{viscpm,
    title={Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages}, 
    author={Jinyi Hu and Yuan Yao and Chongyi Wang and Shan Wang and Yinxu Pan and Qianyu Chen and Tianyu Yu and Hanghao Wu and Yue Zhao and Haoye Zhang and Xu Han and Yankai Lin and Jiao Xue and Dahai Li and Zhiyuan Liu and Maosong Sun},
    journal={arXiv preprint arXiv:2308.12038},
    year={2023}
}
@article{xu2024llava-uhd,
  title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
  author={Xu, Ruyi and Yao, Yuan and Guo, Zonghao and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  journal={arXiv preprint arXiv:2403.11703},
  year={2024}
}
@article{yu2024rlaifv,
  title={RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness}, 
  author={Yu, Tianyu and Zhang, Haoye and Yao, Yuan and Dang, Yunkai and Chen, Da and Lu, Xiaoman and Cui, Ganqu and He, Taiwen and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong},
  journal={arXiv preprint arXiv:2405.17220},
  year={2024}
}

minicpm-v's People

Contributors

Stargazers

Watchers

Forkers

autogyro eddiezero nanqiai unibots1043 creative-v lihuibng 0iui0 xiaozhiob winscat arjun-prakash2 newtolan yuzachongyi raixcreen rosefun josephrp big-data-ai zhangxinyi0529 brianm93 v2hack henri-edh kdu4108 jameshujy chaojigang001 itsharex zzccchen zhouhuashan huangxiang701 fang1418975997 seanliaoy qianyouliang strategist922 originalos frank1016 fatiose alexwang123456 slothme kimsweet ceylone david20080125 liwangchao qibiui alohahalo happybuby quduoduo snailccc dantecoda geneforge 3dalgolab eltociear techthiyanes folkevil selfmoff46socialwil scorpions11 truechilled95righthaja kasikasi2014 restlesscode12 katffli3 steenfrfinaltalk adaluo2510 babixzbabydeckfunk ellynnon72lunetes narcommaf2 sletsuchem omidsfp finmalage-westphold bloodsoldiscoverahon jakubik2023 freddarth93 bingmo33 weisili2016 ailabteam fbigabig xrinairgi tsutomu-n agentsolid53cistflames jiggynywizroses ladywib countryroyal4 heartonreporks gazettekissez-v hhy5277 xclownia mowervie hubin858130 iuysy crayonupdatesf insidelifel kitbs-opissroo jamepeng hay86 xz-wtx laobadao hwwwwwwwh pamxy howardchiang2 varuy322 morinw gptalgopro shliar liwenju0

minicpm-v's Issues

memory leak looping inference

Hi, I got out of memory/memory leak after after some calls. Here the script

model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, cache_dir="./cache", torch_dtype=torch.bfloat16)
model = model.to(device='cuda', dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, cache_dir="./cache")

files = ['a.jpg', 'b.jpg', 'c.jpg', ...]
for image_path in files:
        msgs = [{'role': 'user', 'content': prompt}]
        image = Image.open(image_path).convert('RGB')
        res, context, _ = model.chat(
            image=image,
            msgs=msgs,
            context=None,
            tokenizer=tokenizer,
            sampling=False,
            temperature=0.0
        )
        print(res)

P40上执行示例代码出错ImportError: /home/ubuntu/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

重装flash_attn_2也不能解决。
还是同样错误

Any detail report of minicpm v2?

Such as data/training strategy/ablation study etc?

fine tune

Hello, is it possible (will it be) to fine tune with peft lora and quantization?

Does it support multi-image interleaved conversations?

the demo looks like only support one image in one prompt. i wonder does it support multi-image interleaved conversation? e.g. input 2 images at once and compare which is brighter

Request for a 4bit quantization model for omnilmm 12B

I have tried the multi card inference but failed, it seems the layers are not dispatched correctly.

        with init_empty_weights():
            model = OmniLMMForCausalLM.from_pretrained(model_name, tune_clip=True, torch_dtype=torch.bfloat16)
        model = load_checkpoint_and_dispatch(model, model_name, dtype=torch.bfloat16, 
                    device_map="balanced",  no_split_module_classes=['Eva','MistralDecoderLayer', 'ModuleList', 'Resampler']
        )

So it would be convenience to provide a quantization model for low memory cards, it may be nice if under 20GB single card or 40GB multi cards.

关于3b模型在MME上精度结果不一致问题

我下载了3b模型跑了官方的MME测试集，使用prompt为官方的默认prompt，算出来的分数为1378.67，和给出的1452差别较大。请问是测试结果使用的prompt不一致导致的吗？

finetune code

我注意到没有finetune的代码，是不是可以使用viscpm的代码？还是说team后续会放出？

可否提供一下预训练部分数据的来源？

Or, just opensource the training data

请问什么时候可以开源微调代码？

使用哪一个基座模型？

我观察到minicpm的llm有多个版本，例如fp32，bf16，dpo和sft。
请问多模态模型是使用哪一个llm呢？

数学推理出现幻觉？

数学推理出错
prompt: 1+1/sqrt(2)=?
completion: 1 + 1/sqrt(2) 可以转换为一个数学表达式。

首先，我们可以将根号下的 2 化简为 √2。

然后，我们可以将 1 视为一个常数，并将其加到 1/√2。为了将 1/√2 转换为一个分数，我们可以将其看作是一个带根号的分数。我们可以将根号下的 2 平方，得到 2。因此，1/√2 可以表示为 1/(√2 * √2)，即 1/(2 * 2)。

现在，我们可以将 1 + 1/√2 表示为：

1 + 1/√2 = 1 + 1/2

接下来，我们可以将这个表达式转换为一个带分数：

1 + 1/√2 = 2/2 + 1/2 = 3/2

所以，1 + 1/√2 的结果是 3/2 或 1 1/2。

Training Data Composition for MiniCPM-v2

Thanks for the great work!

I am wondering where we can find the training data composition for MiniCPM-v2? Specifically, if any distillation from other LMMs is used?

训练过程细节有吗？

请问如何部署在安卓手机啊？

如题，感谢🙏

是否支持mac？

使用Mac部署的时候提示没有使用CUDA的错误
`

(OmniLMM) crz@crzdeMacBook-Air OmniLMM % pip install flash_attn

Collecting flash_attn
Downloading flash_attn-2.5.2.tar.gz (2.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 882.4 kB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
fatal: not a git repository (or any of the parent directories): .git
/private/var/folders/fn/bz39wtv175s83wx3cm11lptm0000gn/T/pip-install-k91r8ka8/flash-attn_fccd9cf359024ff8acc3369bdc82e45f/setup.py:78: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
warnings.warn(
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/private/var/folders/fn/bz39wtv175s83wx3cm11lptm0000gn/T/pip-install-k91r8ka8/flash-attn_fccd9cf359024ff8acc3369bdc82e45f/setup.py", line 133, in
CUDAExtension(
File "/Users/crz/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
library_dirs += library_paths(cuda=True)
File "/Users/crz/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
File "/Users/crz/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

  torch.__version__  = 2.0.1
  
  
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.`

miniCPM-V Quantization

Hi, Is it possible convert omniLLM-3B (miniCPM-V) to Quantization version ( e,g: GPTQ ) ?

OmniLMM 支持中文么？

会支持VLLM推理吗

支持llama.cpp 部署么？

看到有fork了这个llama.cpp 的仓库

模型不支持中文？还有批量跑多个图片预测时会缓存前面的结果？

遍历图片时总是复制前面的输出，是因为跑多个图片预测时会缓存前面的输出结果？
而单独对每个图片进行推断，中文ocr效果又很差。是不支持中文ocr识别吗

请什么时候可以发布微调代码？

您好，感谢发布这么优秀的模型。请问什么时候可以提供微调代码呢？

Question about details on the training procedure and data

Hi, and thanks for this fantastic work!

I am using your models for a scientific project, which requires details about the training procedure and training data. I haven't found a paper or a technical report specifically for this work. Did I miss something?

Is the training procedure and data the same as in https://arxiv.org/pdf/2308.12038.pdf ?

Can you please share some details or a technical report?

Thanks :-)

ModuleNotFoundError: No module named 'transformers_modules.MiniCPM-V-2'

安装报错Failed building wheel for pycrypto

OCR场景下无法返回全部的信息

首先非常感谢贵团队做出的卓越的贡献，我打算将其作为OCR使用读取图片上的内容并且给出完整的信息

但是尝试多个prompt之后还是无法有效的输出完整的信息，想请教下有什么解决方法或者正确的使用方式吗？

测试OCR图片源自 <MapReduce: Simplified Data Processing on Large Clusters>第一页

请问支持中文数据微调/推理吗

可以支持多卡部署并行推理么？

cannot load MiniCPM-V-2.0, ModuleNotFoundError: No module named 'transformers_modules.MiniCPM-V-2'

Benchmark results of TextVQA

We got different benchmark results while evaluation the MiniCPM-V 2.0 model provided in huggingface. For example, we got only 66 on TextVQA, which is different from 74.1 provided in the technique report. We directly use the chat api in modeling_minicpmv.py like this.

        output_ids, context, _ = model.chat(
            image=image_tensor,
            msgs=msgs,
            context=None,
            tokenizer=tokenizer,
            sampling=True,
            temperature=0.7
    	)

Would you please provide the detailed hyperparameters for the evaluation?

MMMU val 分数疑问

关于MiniCPM-V，我注意到Github上报告MMMU val 34.7，而官方leaderboard是37.2？

MiniCPM-V和OmniLMM都支持中文吗？

请问支持中文提问和回答吗?

我看示例截图里都是英语的.

ModuleNotFoundError: No module named 'transformers_modules.MiniCPM-V-2'

(OmniLMM) [zhuguibo@GPU1 OmniLMM]$ python example.py
Traceback (most recent call last):
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/example.py", line 52, in
option1()
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/example.py", line 5, in option1
chat_model = OmniLMMChat('/mnt/data/dataProcess/LLM_Models/MiniCPM-V-2.0')
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/chat.py", line 169, in init
self.model = OmniLMM3B(model_path)
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/chat.py", line 141, in init
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.bfloat16)
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 526, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1091, in from_pretrained
config_class = get_class_from_dynamic_module(
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 500, in get_class_from_dynamic_module
return get_class_in_module(class_name, final_module.replace(".py", ""))
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 200, in get_class_in_module
module = importlib.import_module(module_path)
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 992, in _find_and_load_unlocked
File "", line 241, in _call_with_frames_removed
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 992, in _find_and_load_unlocked
File "", line 241, in _call_with_frames_removed
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers_modules.MiniCPM-V-2'

实时多模态交互代码demo部分源码大概什么时候可以开源呢？可以说一下思路吗？

请问一下演示的demo部分的代码大概什么时候开源呢？可以先说一下框架和实现的思路吗？

redmi k70 16GB使用MiniCPM-V时报错

MLCChat failed

Stack trace:
org.apache.tvm.Base$TVMError: TVMError: OpenCL build error for device=0x78b138eaa8
Error: CL_OUT_OF_HOST_MEMORY
Pass
Stack trace:
File "/Users/acha/Desktop/mlc-MiniCPM/3rdparty/tvm/src/runtime/library_module.cc", line 78

at org.apache.tvm.Base.checkCall(Base.java:173)
at org.apache.tvm.Function.invoke(Function.java:130)
at com.modelbest.mlcllm.ChatModule.image(ChatModule.java:64)
at com.modelbest.minicpm.AppViewModel$ChatState$requestImage$f$1$1.invoke(AppViewModel.kt:637)
at com.modelbest.minicpm.AppViewModel$ChatState$requestImage$f$1$1.invoke(AppViewModel.kt:637)
at com.modelbest.minicpm.AppViewModel$ChatState.callBackend(AppViewModel.kt:516)
at com.modelbest.minicpm.AppViewModel$ChatState.requestImage$lambda$4(AppViewModel.kt:637)
at com.modelbest.minicpm.AppViewModel$ChatState.$r8$lambda$-81FfOHgL43Kw6mcY6rLiYQKEuE(Unknown Source:0)
at com.modelbest.minicpm.AppViewModel$ChatState$$ExternalSyntheticLambda2.run(Unknown Source:12)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:487)
at java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
at java.lang.Thread.run(Thread.java:1012)

Error message:
TVMError: OpenCL build error for device=0x78b138eaa8
Error: CL_OUT_OF_HOST_MEMORY
Pass
Stack trace:
File "/Users/acha/Desktop/mlc-MiniCPM/3rdparty/tvm/src/runtime/library_module.cc", line 78

和yi-vl-34b相比如何？

V2.0 可以直接使用 V1 的微调代码么

RT
看到 swift 里面有 V1 的微调代码，请问在 V2.0 上可以直接使用么，还是需要重新开发一下？

object detection能力

请问，支持detection返回bbox吗？很多多模态模型已经支持了，且性能也很优秀，为什么cpmv我测了感觉不支持。
但是cpmv对空间位置关系的感知能力也很强，是什么原因导致放弃了训练这一功能呢？
如果加上detect的预训练，做finetune时有bbox输入对应文本，是否能让模型更容易获取新知识？

方便提供openai_api接口吗，这样我就可以做成服务调用了，VLLM就更好了

方便提供openai_api接口吗，这样我就可以做成服务调用了，VLLM就更好了，听说这周就可以实现吗？

请问下2.0版本的webui有没有demo

http://120.92.209.146/ 这个demo地址的webui我想测试下。

https://openbmb.vercel.app/minicpm-v-2 挂掉了吗？打不开了。

MiniCPM-v和OmniLMM是啥关系？应该用哪个？

installed cuda ，but there is a mismatch in cuda verision ，the code couldn't run without cuda

I installed conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c
pytorch -c nvidia
but an error message showed up, RuntimeError: cutlassF: no kernel found to launch! ， it indicates that my cuda is not the correct verision ， I downloaded the model from modelscope

我根据requirements.txt安装了依赖但是在运行代码时系统提示我需要安装cuda ，但我安装了cuda后依然报错，是说我的cuda版本与你们的不一致，我是自己下载了模型，并且在使用

object hal的评估方法

请问榜单里给出的object hal的分数是如何计算的，和RLHF-V中的计算方式是否有不同。

Error about missing methods

Hello. I'm getting followed error when i try to run the demo of MiniCPM-V on Ubuntu 22.04：
(OmniLMM) markusc@ubuntu2:~/github/OmniLMM$ /home/markusc/anaconda3/envs/OmniLMM/bin/python /home/markusc/github/OmniLMM/infer.py Loading checkpoint shards: 100%|████████████████████████████████████████████| 2/2 [00:01<00:00, 1.94it/s] Traceback (most recent call last): File "/home/markusc/github/OmniLMM/infer.py", line 10, in <module> answer = chat_model.process(inputs) AttributeError: 'OmniLMMChat' object has no attribute 'process'

Apart from importing the json module and changing the model path to openbmb/MiniCPM-V, I did not make any modifications to the original code. As followed:

`from chat import OmniLMMChat, img2base64
import json
chat_model = OmniLMMChat('openbmb/MiniCPM-V')

im_64 = img2base64('./assets/worldmap_ck.jpg')

msgs = [{"role": "user", "content": "What is interesting about this image?"}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)

msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Where is China in the image"})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)`

Thanks in advance.

是否能开源训练数据复现以排除刷榜行为？
是否能简单写明训练方法，例如proj训练，是否冻住vit？

请问MiniCPM-V能使用类似stream为True的参数来实现流式生成吗？

请问MiniCPM-V如何使用流式的方式生成？像是使用transformers的streamer之类的。

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

openbmb / minicpm-v Goto Github PK

minicpm-v's Introduction

News

📌 Pinned

Contents

MiniCPM-Llama3-V 2.5

Evaluation

Examples

MiniCPM-V 2.0

Examples

Legacy Models

Chat with Our Demo on Gradio

Online Demo

Local WebUI Demo

Install

Inference

Model Zoo

Multi-turn Conversation

Inference on Multiple GPUs

Inference on Mac

Deployment on Mobile Phone

Inference with llama.cpp

Inference with vLLM

Fine-tuning

Simple Fine-tuning

With the SWIFT Framework

TODO

Model License

Statement

Institutions

Other Multimodal Projects from Our Team

🌟 Star History

Citation

minicpm-v's People

Contributors

Stargazers

Watchers

Forkers

minicpm-v's Issues

Recommend Projects

Recommend Topics

Recommend Org