AIFeb 27

Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Longyin Zhang, Shuo Sun, Yingxu He, Won Cheng Yi Lewis, Muhammad Huzaifah Bin Md Shahrin, Hardik Bhupendra Sailor, Heng Meng Jeremy Wong, Tarun Kumar Vangani, Yi Ma, Qiongqiong Wang, Minh Duc Pham, Ridong Jiang

arXiv:2602.23730v12.4h-index: 13

Originality Incremental advance

AI Analysis

This work addresses the perception-logic trade-off in multimodal AI for underrepresented regions like Southeast Asia, though it appears incremental as it builds on existing MLLM frameworks with a focus on specific regional adaptations.

The researchers tackled the challenge of integrating robust sensory grounding with complex reasoning in multimodal large language models for underrepresented regions, introducing MERaLiON2-Omni (Alpha), a 10B-parameter model tailored for Southeast Asia, and found that while reasoning boosts performance in abstract tasks, it introduces instability in low-level sensory processing, such as temporal drift in audio and visual over-interpretation.

Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly for underrepresented regions. In this report, we introduce the research preview of MERaLiON2-Omni (Alpha), a 10B-parameter multilingual omni-perception tailored for Southeast Asia (SEA). We present a progressive training pipeline that explicitly decouples and then integrates "System 1" (Perception) and "System 2" (Reasoning) capabilities. First, we establish a robust Perception Backbone by aligning region-specific audio-visual cues (e.g., Singlish code-switching, local cultural landmarks) with a multilingual LLM through orthogonal modality adaptation. Second, to inject cognitive capabilities without large-scale supervision, we propose a cost-effective Generate-Judge-Refine pipeline. By utilizing a Super-LLM to filter hallucinations and resolve conflicts via a consensus mechanism, we synthesize high-quality silver data that transfers textual Chain-of-Thought reasoning to multimodal scenarios. Comprehensive evaluation on our newly introduced SEA-Omni Benchmark Suite reveals an Efficiency-Stability Paradox: while reasoning acts as a non-linear amplifier for abstract tasks (boosting mathematical and instruction-following performance significantly), it introduces instability in low-level sensory processing. Specifically, we identify Temporal Drift in long-context audio, where extended reasoning desynchronizes the model from acoustic timestamps, and Visual Over-interpretation, where logic overrides pixel-level reality. This report details the architecture, the data-efficient training recipe, and a diagnostic analysis of the trade-offs between robust perception and structured reasoning.

View on arXiv PDF

Similar