CL MMSep 22, 2025

RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios

Fei Zhao, Chengqiang Lu, Yufan Shen, Qimeng Wang, Yicheng Qian, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Zhen Wu, Shangyu Xing, Xinyu Dai

arXiv:2509.17421v14.91 citationsh-index: 12Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This provides a benchmark for evaluating multi-image understanding in Chinese, addressing a gap for researchers and developers in multilingual AI, though it is incremental as it extends existing English datasets to a new language.

The authors tackled the lack of a Chinese multimodal multi-image dataset by introducing RealBench, containing 9393 samples and 69910 images, and found that even top closed-source models struggle with these scenarios, with open-source models lagging by an average of 71.8%.

While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8\% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context.

View on arXiv PDF

Similar