cs.MMComputer Science

Multimedia

Multimedia systems, content analysis

46.9CVJun 1Code11k

Cosmos 3: Omnimodal World Models for Physical AI

Aditi, Niket Agarwal, Arslan Ali et al.

This work provides a scalable, general-purpose backbone for embodied agents by unifying multiple modalities into a single framework, which is a significant step for Physical AI research.

23.5CVMar 16Code23

EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing

Zitong Xu, Huiyu Duan, Zhongpeng Ji et al.

This work addresses a bottleneck in scalable human feedback for image editing, enabling better evaluation and optimization of editing models, though it is incremental in building upon existing MLLM and reinforcement learning techniques.

19.0CLMay 27Code1k

Rethinking Memory as Continuously Evolving Connectivity

Jizhan Fang, Buqiang Xu, Zhixian Wang et al.

For LLM agents operating in dynamic environments, FluxMem addresses the brittleness of static memory by enabling adaptive connectivity evolution, leading to consistent SOTA results across diverse benchmarks.

20.8CVApr 10Code

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Junchao Liao, Zhenghao Zhang, Xiangyu Meng et al.

This addresses the challenge of physical coherence in audio-video generation for applications like media production, though it is incremental as it builds on existing methods with a novel kinematic prior.

33.3CVApr 9Code97

A Survey on 3D Gaussian Splatting

Guikun Chen, Wenguan Wang

It addresses the need for a comprehensive overview of this emerging method for researchers in computer graphics and vision, but it is incremental as it surveys existing developments rather than introducing new findings.

18.2CVMar 12Code65

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Quanhao Li, Zhen Xing, Rui Wang et al.

This work addresses efficiency and quality issues in video generation for applications requiring precise motion control, though it is incremental as it builds on existing adapter and distillation techniques.

18.2SDMar 12

Audio-Language Models for Audio-Centric Tasks: A Systematic Survey

Yi Su, Jisheng Bai, Qisheng Xu et al.

This is an incremental survey that helps researchers and practitioners in audio-centric AI by summarizing existing technologies and providing references for practical applications.

35.3CVApr 22Code

Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen et al.

For researchers and practitioners in video understanding and generation, this work provides a scalable method to produce high-quality, structured captions that improve both VLM performance and text-to-video generation control.

20.4SDJun 3

Audio Interaction Model

Zhifei Xie, Zihang Liu, Ze An et al.

This work addresses the need for a single model that can handle multiple streaming audio tasks (e.g., voice chatting, ASR) in real time, unifying capabilities that were previously separate.

16.9CVJun 5Code26

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng, Yue Tan, Qi Xu et al.

For researchers in video understanding and MLLMs, this survey offers a structured framework to categorize and compare approaches, but it is an incremental synthesis of existing work rather than a novel contribution.

17.6CVMar 25

AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem et al.

This addresses the need for modular and efficient control in audio-visual generation for researchers and practitioners, offering a significant improvement over monolithic or costly methods.

16.2CVApr 29Code6

MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

Shuzhao Xie, Junchen Ge, Weixiang Zhang et al.

For practitioners deploying 3DGS in storage-constrained environments, this method provides a size-aware codec that accurately meets target budgets without retraining.

25.1CVApr 9

LPM 1.0: Video-based Character Performance Model

Ailing Zeng, Casper Yang, Chauncey Ge et al.

This addresses the problem of creating lifelike virtual characters for conversational agents, live streaming, and games, representing a novel method rather than an incremental improvement.

15.0CVMay 17

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Yuheng Chen, Qingdong He, Teng Hu et al.

This work addresses the underexplored problem of multimodal customization for simultaneous identity preservation in audio-video generation, offering a solution for content creators needing consistent character voices and appearances.

14.6CVMar 20

EgoForge: Goal-Directed Egocentric World Simulator

Yifan Shen, Jiateng Liu, Xinzhuo Li et al.

This work addresses the problem of simulating dynamic egocentric environments for applications like smart-glasses, though it is incremental as it builds on existing generative world models.

25.8AIApr 27

Co-Director: Agentic Generative Video Storytelling

Yale Song, Yiwen Song, Nick Losier et al.

For AI video generation, Co-Director addresses semantic drift in agentic pipelines, offering a principled optimization approach that generalizes to cinematic narratives.

10.5CLMar 18Code

Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

Ziyi He, Yushi Feng, Shuangyu Yang et al.

This work addresses the safety-critical need for better multimodal AI in dental clinical routing, though it is incremental as it introduces a new benchmark without a novel method.

15.8CRMar 23Code

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee

This addresses a safety vulnerability in MLLMs for users relying on visual reasoning, though it is incremental as it builds on existing jailbreak benchmarks.

17.3MMMar 12

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su, Yuming Li, Zeyue Xue et al.

This enables real-time applications for joint audio-visual generation, addressing a bottleneck in multi-modal AI systems, though it is incremental as it builds on existing diffusion models.

14.7CVMar 19

EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Yuzhe Weng, Haotian Wang, Yuanhong Yu et al.

This work addresses the problem of generating realistic and controllable talking head videos for applications like virtual avatars and video editing, representing a novel paradigm rather than an incremental improvement.