CVDec 2, 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

arXiv:2412.01800v114 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of physical commonsense in video LLMs, particularly for researchers and developers in AI and video understanding, though it is incremental as it builds on existing video LLM frameworks.

The paper tackles the problem of evaluating physical commonsense understanding in video large language models by introducing PhysGame, a benchmark with 880 gameplay videos containing glitches across four domains, and finds that current open-source models significantly lag behind proprietary ones. They propose PhysVLM, a model enhanced with physical knowledge, which achieves state-of-the-art performance on PhysGame and general benchmarks.

Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes