CVFeb 10

SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata

arXiv:2602.09528v11.5h-index: 5

Originality Highly original

AI Analysis

This addresses hallucinations in MLLMs for high-stakes applications like healthcare, representing a novel method for a known bottleneck.

The paper tackles hallucinations in Multimodal Large Language Models (MLLMs) by proposing SchröMind, a framework that solves the Schrödinger bridge problem to map hallucinatory to truthful activations, achieving state-of-the-art performance on POPE and MME benchmarks with minimal computational overhead.

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.

View on arXiv PDF

Similar