CVMay 11

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

arXiv:2605.0999696.61 citationsHas Code
Predicted impact top 6% in CV · last 90 daysOriginality Highly original
AI Analysis

Provides a diagnostic framework and benchmark for evaluating and improving omnimodal personalization, highlighting critical grounding and calibration issues for researchers developing multimodal personalization systems.

Omni-Persona introduces the first comprehensive benchmark for omnimodal personalization, covering text, image, and audio across 18 tasks. Key findings include a consistent audio-vs-visual grounding gap in open-source models, the inadequacy of answerable recall and parameter scale as sole diagnostics, and that RLVR generalizes better than SFT but can lead to conservative behavior.

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes