CV AIAug 8, 2025

Can Large Models Fool the Eye? A New Turing Test for Biological Animation

Zijian Chen, Lirong Deng, Zhengyu Chen, Kaiwei Zhang, Qi Jia, Yuan Tian, Yucheng Zhu, Guangtao Zhai

arXiv:2508.06072v113.16 citationsh-index: 27Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of evaluating large models for researchers and developers by offering a more intuitive and discriminative benchmark based on visual perception, though it is incremental in improving evaluation methods.

The paper introduces BioMotion Arena, a framework that evaluates large language models and multimodal large language models by using visual animations of biological motion to highlight performance gaps. Results show that over 90% of tested models, including advanced ones like InternVL3 and Claude-4, fail to generate basic humanoid point-light animations, providing a challenging benchmark for model assessment.

Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90\% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.

View on arXiv PDF

Similar