Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
This work addresses alignment issues in multimodal AI models, which is an incremental improvement for enhancing the performance of UMMs in tasks like understanding and generation.
The paper tackles the problem of granularity mismatch and supervisory redundancy in Unified Multimodal Models (UMMs) by introducing Semantically-Grounded Supervision (SeGroS), a fine-tuning framework that improves generation fidelity and cross-modal alignment across various UMM architectures, as demonstrated on benchmarks like GenEval, DPGBench, and CompBench.
Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.