AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

How do Multimodal LLMs perform on Image Aesthetics Perception?

^1*Yipo Huang, ^1*Quan Yuan,¹Xiangfei Sheng, ¹Zhichao Yang,²Haoning Wu

¹Pengfei Chen, ³Yuzhe Yang,^1#Leida Li, ²Weisi Lin

¹Xidian University, ²Nanyang Technological University, ³OPPO Research Institute

^*Equal contribution. ^#Corresponding author.

If you like this research, please give us a star ⭐ on GitHub.

We construct a high-quality Expert-labeled Aesthetic Perception Database (EAPD), based on which we further build the golden benchmark to evaluate four abilities of MLLMs on image aesthetics perception, including Aesthetic Perception (AesP), Aesthetic Empathy (AesE), Aesthetic Assessment (AesA) and Aesthetic Interpretation (AesI).

News

[2024/01/20] 🎉 Congrats to SPHINX-MoE for achieving new SOTAs on AesP and AesE!!
[2024/01/18] 🤗 Database of AesBench now support Huggingface!
[2024/01/17] 🚩 We have released the Evaluation Database and Codes of AesBench! Check Here for more details.

GPT-4V and Gemini Pro Vision!

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesP.

Rank	MLLM	Tec. Qua.	Col. Lig.	Comp.	Content	NIs	AIs	AGIs	Yes-No	What	How	Why	Overall
🥇	SPHINX-MoE	66.67%	76.31%	72.68%	66.31%	75.84%	72.19%	68.88%	69.12%	62.18%	80.38%	88.05%	72.93%
🥈	Q-Instruct	66.03%	74.48%	73.68%	68.09%	76.48%	69.70%	69.28%	64.68%	63.31%	85.28%	86.34%	72.61%
🥉	GPT-4V	69.02%	74.66%	71.72%	65.57%	75.67%	72.58%	65.82%	68.93%	64.67%	76.70%	84.46%	72.08%
4	Gemini Pro Vision	65.08%	74.57%	72.24%	67.97%	74.63%	69.62%	70.03%	64.70%	64.95%	78.71%	90.24%	71.99%
5	ShareGPT4V	62.18%	71.90%	69.29%	64.89%	70.79%	71.57%	63.96%	69.32%	61.33%	72.01%	77.56%	69.18%
6	mPLUG-Owl2	60.90%	70.57%	68.30%	62.77%	72.23%	64.71%	64.10%	65.59%	58.64%	73.02%	80.73%	67.89%
7	LLaVA-1.5	53.85%	70.16%	67.40%	59.93%	69.10%	65.71%	62.37%	62.36%	58.92%	70.71%	81.22%	66.32%
8	Qwen-VL	54.81%	66.25%	62.91%	60.64%	68.30%	58.85%	59.44%	61.25%	55.38%	67.53%	74.15%	63.21%
9	LLaVA	46.79%	63.59%	65.30%	64.54%	64.29%	61.10%	60.77%	65.39%	52.27%	61.18%	74.88%	62.43%
10	InstructBLIP	37.82%	55.36%	55.43%	57.09%	57.06%	55.86%	47.21%	59.84%	45.01%	54.98%	56.34%	54.29%
11	MiniGPT-v2	56.73%	56.44%	51.74%	50.00%	56.74%	53.24%	50.93%	53.99%	43.06%	58.73%	66.10%	54.18%
12	GLM	55.77%	54.61%	51.25%	48.94%	54.90%	55.24%	47.34%	60.95%	44.62%	48.48%	55.61%	52.96%
13	Otter	35.90%	54.28%	51.65%	51.06%	51.04%	50.62%	51.20%	56.10%	44.48%	51.37%	49.02%	50.96%
14	IDEFICS-Instruct	37.50%	52.87%	52.84%	51.06%	52.97%	50.12%	48.40%	50.96%	44.62%	51.09%	60.73%	50.82%
15	MiniGPT-4	39.42%	41.31%	42.67%	44.33%	41.57%	42.89%	41.36%	47.23%	32.01%	41.99%	46.10%	41.93%
16	TinyGPT-V	21.79%	24.52%	22.13%	28.01%	22.71%	24.69%	24.34%	32.39%	17.99%	19.77%	19.27%	23.71%

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesE.

Rank	MLLM	Emotion	Interest	Uniqueness	Vibe	NIs	AIs	AGIs	Yes-No	What	How	Why	Overall
🥇	SPHINX-MoE	68.59%	80.65%	75.86%	82.14%	74.72%	75.19%	69.02%	74.95%	62.89%	72.71%	88.48%	73.32%
🥈	Q-Instruct	68.64%	83.86%	75.86%	80.00%	76.65%	72.19%	66.62%	64.30%	67.42%	81.57%	86.76%	72.68%
🥉	Gemini Pro Vision	66.87%	87.50%	70.00%	79.09%	70.60%	72.35%	71.53%	67.50%	64.52%	72.25%	90.37%	71.37%
4	ShareGPT4V	66.48%	80.65%	68.97%	78.72%	70.95%	73.69%	67.29%	67.75%	65.58%	72.71%	83.58%	70.75%
5	GPT-4V	65.06%	72.41%	62.07%	80.15%	73.87%	72.08%	62.27%	68.67%	64.02%	70.07%	84.20%	70.16%
6	mPLUG-Owl2	65.60%	77.42%	65.52%	78.07%	71.03%	71.57%	66.22%	68.05%	64.16%	70.14%	83.82%	69.89%
7	LLaVA-1.5	62.49%	80.65%	75.85%	78.93%	69.26%	69.58%	65.43%	62.37%	64.16%	71.71%	84.07%	68.32%
8	LLaVA	58.61%	80.63%	65.52%	75.83%	67.01%	66.96%	58.38%	67.95%	55.95%	60.14%	79.66%	64.68%
9	Qwen-VL	58.67%	83.87%	72.41%	73.90%	63.88%	67.08%	61.57%	60.65%	58.07%	66.14%	79.90%	64.18%
10	MiniGPT-v2	52.52%	58.06%	44.83%	58.07%	55.86%	55.85%	50.27%	57.81%	43.48%	53.43%	66.42%	54.36%
11	GLM	53.13%	70.97%	44.83%	55.29%	56.58%	54.86%	48.67%	60.65%	41.78%	50.43%	64.95%	53.96%
12	InstructBLIP	49.64%	58.06%	51.72%	61.50%	55.06%	55.24%	48.94%	55.88%	50.99%	51.43%	58.33%	53.89%
13	Otter	48.42%	70.97%	51.72%	63.21%	53.05%	55.74%	52.39%	54.77%	51.84%	53.43%	54.41%	53.64%
14	IDEFICS-Instruct	43.93%	64.52%	62.07%	64.06%	50.72%	53.12%	49.07%	50.20%	41.08%	52.43%	66.42%	50.82%
15	MiniGPT-4	39.78%	38.71%	24.14%	39.04%	42.70%	37.78%	35.51%	50.61%	31.59%	31.86%	38.48%	39.35%
16	TinyGPT-V	30.36%	29.03%	31.03%	35.40%	32.50%	36.03%	26.99%	36.00%	29.89%	28.86%	31.62%	32.04%

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesA.

Rank	MLLM	NIs	AIs	AGIs	Overall
🥇	Q-Instruct	62.20%	49.75%	40.69%	52.86%
🥈	GPT-4V	59.98%	46.92%	40.59%	50.86%
🥉	mPLUG-Owl2	57.78%	49.50%	40.83%	50.57%
4	SPHINX-MoE	57.62%	48.50%	38.70%	49.93%
5	Gemini Pro Vision	54.17%	48.39%	42.20%	49.38%
6	ShareGPT4V	54.65%	48.38%	35.90%	47.82%
7	InstructBLIP	52.73%	47.88%	34.84%	46.54%
8	Qwen-VL	54.25%	39.28%	40.43%	46.25%
9	LLaVA	51.69%	48.00%	34.31%	45.96%
10	LLaVA-1.5	50.08%	48.13%	34.97%	45.46%
11	IDEFICS-Instruct	50.00%	47.76%	33.78%	45.00%
12	Otter	49.20%	48.25%	34.04%	44.86%
13	TinyGPT-V	44.06%	41.65%	44.81%	43.57%
14	MiniGPT-4	41.65%	36.28%	35.90%	38.57%
15	GLM	38.92%	37.78%	35.90%	37.79%
16	MiniGPT-v2	27.05%	31.92%	36.97%	31.11%

Here is the comparison of GPT-4V, Gemini Pro Vision, and other OA MLLMs on AesI.

Rank	Model	Relevance	Precision	Completeness	Overall
🥇	GPT-4V	1.385	1.151	1.366	1.301
🥈	ShareGPT4V	1.440	1.117	1.331	1.296
🥉	SPHINX-MoE	1.501	1.171	1.130	1.267
4	Gemini Pro Vision	1.416	1.087	1.164	1.222
5	Qwen-VL	1.393	1.006	1.175	1.192
6	mPLUG-Owl2	1.402	1.016	1.130	1.182
7	IDEFICS-Instruct	1.406	1.007	1.126	1.180
8	LLaVA-1.5	1.397	0.953	1.120	1.157
9	InstructBLIP	1.372	0.863	1.144	1.126
10	LLaVA	1.374	0.918	1.084	1.125
11	Otter	1.242	0.848	0.989	1.027
12	Q-Instruct	1.222	0.939	0.898	1.020
13	MiniGPT-v2	1.191	0.868	0.948	1.003
14	MiniGPT-4	1.158	0.823	1.016	0.999
15	GLM	1.122	0.729	0.944	0.932
16	TinyGPT-V	0.871	0.511	0.720	0.701

Submission Guideline

via GitHub Release: Please see our release for details.

Acknowledgement

Special thanks are extended to the 32 aesthetic experts who participated in our experiments, whose rich aesthetic experience and responsible attitude played a crucial role in the construction of the dataset. We highlight the following:

Wei Liu, Xin Liu, Luxia Chen, Tianjiao Gu, Dahai Tian, Ziyan Ou, et al.

Many thanks are extended to collaborators, for their kind assistance in data collection and MLLM deployment:

Zhichao Duan and Pangu Xie.

Citation

If you find our work interesting, please feel free to cite our paper:

@article{AesBench,
    title={AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception},
    author={Huang, Yipo and Yuan, Quan and Sheng, Xiangfei and Yang, Zhichao and Wu, Haoning and Chen, Pengfei and Yang, Yuzhe and Li, Leida and Lin, Weisi},
   journal={arXiv preprint arXiv:2401.08276},
    year={2024},
}

lihuibng / aesbench Goto Github PK

aesbench's Introduction

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

If you like this research, please give us a star ⭐ on GitHub.

News

GPT-4V and Gemini Pro Vision!

Submission Guideline

Acknowledgement

Citation

aesbench's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent