CV AIApr 20, 2025

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang

arXiv:2504.14693v237 citationsh-index: 21Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for researchers and developers to assess LMMs in lecture comprehension, though it is incremental as it builds on existing video understanding benchmarks.

The authors tackled the problem of evaluating language multimodal models (LMMs) for understanding multi-discipline lectures by introducing Video-MMLU, a massive benchmark, and found that current models, tested from 0.5B to 40B parameters, have limitations in tasks requiring perception and reasoning.

Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.

View on arXiv PDF

Similar