Back to Explore
cs.MMComputer Science

Multimedia

Multimedia systems, content analysis

99.5CVApr 9
LPM 1.0: Video-based Character Performance Model

Ailing Zeng, Casper Yang, Chauncey Ge et al.

This addresses the problem of creating lifelike virtual characters for conversational agents, live streaming, and games, representing a novel method rather than an incremental improvement.

99.4CVApr 22Code
Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen et al.

For researchers and practitioners in video understanding and generation, this work provides a scalable method to produce high-quality, structured captions that improve both VLM performance and text-to-video generation control.

99.0CLMay 27Code
Rethinking Memory as Continuously Evolving Connectivity

Jizhan Fang, Buqiang Xu, Zhixian Wang et al.

For LLM agents operating in dynamic environments, FluxMem addresses the brittleness of static memory by enabling adaptive connectivity evolution, leading to consistent SOTA results across diverse benchmarks.

99.1CVJun 1Code
Cosmos 3: Omnimodal World Models for Physical AI

Aditi, Niket Agarwal, Arslan Ali et al.

This work provides a scalable, general-purpose backbone for embodied agents by unifying multiple modalities into a single framework, which is a significant step for Physical AI research.

98.7CVApr 1
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen et al.

This work addresses the challenge of real-world image generation involving culturally significant and long-tail factual concepts for applications requiring external knowledge grounding, representing an early exploration of agent-based modeling in this domain.

99.1MMMar 20
Leum-VL Technical Report

Yuxuan He, Chaiming Huang, Yifan Wu et al.

This addresses the need for better structural understanding in video AI for applications like editing and recommendation, representing a novel method for a known bottleneck.

97.9CVMay 21
Bernini: Latent Semantic Planning for Video Diffusion

Bernini Team, Chenchen Liu, Junyi Chen et al.

This work addresses the need for semantically grounded video generation and editing by combining the reasoning capabilities of MLLMs with the fidelity of diffusion models, offering a unified framework that outperforms existing methods.

97.3MMJun 4Code
UNIVID: Unified Vision-Language Model for Video Moderation

Kejuan Yang, Yizhuo Zhang, Mingyuan Du et al.

For industrial-scale video moderation, UNIVID provides an interpretable and efficient alternative to fragmented black-box classifiers, reducing maintenance overhead and improving moderation accuracy.