CVDec 11, 2025

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

arXiv:2512.10958v119 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This addresses the lack of unified evaluation for driving world models, which is crucial for researchers and developers in embodied AI to ensure models behave realistically, though it is incremental as it focuses on benchmarking rather than a new modeling paradigm.

The paper tackles the problem of evaluating generative world models for driving by introducing WorldLens, a benchmark that assesses visual realism, geometric consistency, physical plausibility, and functional reliability, revealing that no existing model excels universally across these dimensions. It includes a dataset of 26K human-annotated videos and an evaluation model to provide scalable, explainable scoring.

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes