CVFeb 12

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

arXiv:2602.12221v1h-index: 91
Originality Incremental advance
AI Analysis

This addresses multimodal AI challenges for applications requiring joint reasoning and generation, though it appears incremental as it builds on existing flow-matching and adapter techniques.

The paper tackles the problem of multimodal understanding, generation, and editing by proposing UniDFlow, a unified discrete flow-matching framework that decouples tasks via adapters and uses reference-based alignment for improved faithfulness and controllability. It achieves state-of-the-art performance across eight benchmarks and demonstrates strong zero-shot generalization to various tasks without explicit training.

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes