Tianran Wang

2papers

2 Papers

99.1CLApr 30Code
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang et al.

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

CVJan 11, 2021
Horizontal-to-Vertical Video Conversion

Tun Zhu, Daoxin Zhang, Yao Hu et al.

Alongside the prevalence of mobile videos, the general public leans towards consuming vertical videos on hand-held devices. To revitalize the exposure of horizontal contents, we hereby set forth the exploration of automated horizontal-to-vertical (abbreviated as H2V) video conversion with our proposed H2V framework, accompanied by an accurately annotated H2V-142K dataset. Concretely, H2V framework integrates video shot boundary detection, subject selection and multi-object tracking to facilitate the subject-preserving conversion, wherein the key is subject selection. To achieve so, we propose a Rank-SS module that detects human objects, then selects the subject-to-preserve via exploiting location, appearance, and salient cues. Afterward, the framework automatically crops the video around the subject to produce vertical contents from horizontal sources. To build and evaluate our H2V framework, H2V-142K dataset is densely annotated with subject bounding boxes for 125 videos with 132K frames and 9,500 video covers, upon which we demonstrate superior subject selection performance comparing to traditional salient approaches, and exhibit promising horizontal-to-vertical conversion performance overall. By publicizing this dataset as well as our approach, we wish to pave the way for more valuable endeavors on the horizontal-to-vertical video conversion task.