Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment
This work addresses the challenge of semantic and temporal alignment in multi-modal learning for applications like video understanding, though it appears incremental as it builds on existing text-based unification methods.
The paper tackles the problem of inconsistencies in multi-modal learning by proposing UMaT, a framework that unifies visual and auditory inputs as structured text for large language models, resulting in significant improvements in Long Video Question Answering accuracy, such as up to 13.7% and 16.9% on long videos.
While multi-modal learning has advanced significantly, current approaches often create inconsistencies in representation and reasoning of different modalities. We propose UMaT, a theoretically-grounded framework that unifies visual and auditory inputs as structured text for large language models, addressing semantic alignment, temporal synchronization, and efficient sparse information retrieval. It significantly improves state-of-the-art Long Video Question Answering accuracy (up to 13.7%, and 16.9% on long videos) via redundancy minimization and structured textual representation for unified multi-modal reasoning