RIV: Recursive Introspection Mask Diffusion Vision Language Model
This addresses the problem of error propagation in multimodal understanding for AI researchers, though it is incremental as it builds on existing MDVLM frameworks.
The paper tackles the lack of self-correction in Mask Diffusion-based Vision Language Models by proposing RIV, which introduces Introspection Training and Recursive Inference to enable error detection and correction, achieving state-of-the-art performance on multiple benchmarks.
Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($\text{unmask}\rightarrow\text{introspection}\rightarrow\text{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.