CVJan 26

Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge

arXiv:2601.18698v1h-index: 2

Originality Synthesis-oriented

AI Analysis

This addresses the problem of geographic fairness in AI models for researchers and developers, providing insights into bias in globally deployed applications, though it is incremental as it focuses on evaluation rather than proposing new methods.

The paper investigates whether text-to-video generation models, specifically Sora 2, encode geographically equitable visual knowledge by evaluating their ability to synthesize tourist attractions from diverse regions, finding that the model shows relatively uniform performance across regions, development levels, and cultural groupings with only weak dependence on popularity.

Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.

View on arXiv PDF

Similar