MA-Bench: Towards Fine-grained Micro-Action Understanding
This work addresses a gap in evaluating MLLMs for fine-grained human micro-action analysis, which is important for emotion analysis, but it is incremental as it focuses on benchmarking and training data rather than novel model development.
The authors tackled the lack of benchmarks for micro-action understanding in Multimodal Large Language Models (MLLMs) by introducing MA-Bench, a dataset with 1,000 videos and 12,000 question-answer pairs, and found that 23 MLLMs struggle with fine-grained motion capture, but fine-tuning Qwen3-VL-8B on a 20.5K-video training corpus improved performance.
With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io