AIJun 2, 2025

RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

arXiv:2506.01934v14 citationsh-index: 62
Originality Highly original
AI Analysis

This addresses the problem of enabling more natural, real-time human-AI interaction in embodied contexts, representing a novel advancement rather than an incremental improvement.

The paper tackles the challenge of creating an AI model that processes multiple modalities (e.g., vision, audio, text) with full-duplex responses, similar to human interaction, and presents RoboEgo, which achieves a theoretical duplex latency of 80 ms and superior responsiveness and speech naturalness in real-world streaming conversations while matching content quality to state-of-the-art semi-duplex models.

Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes