Ailing Zeng, Casper Yang, Chauncey Ge et al.
This addresses the problem of creating lifelike virtual characters for conversational agents, live streaming, and games, representing a novel method rather than an incremental improvement.
Multimedia systems, content analysis
Ailing Zeng, Casper Yang, Chauncey Ge et al.
This addresses the problem of creating lifelike virtual characters for conversational agents, live streaming, and games, representing a novel method rather than an incremental improvement.
Zhiqiu Lin, Chancharik Mitra, Siyuan Cen et al.
For researchers and practitioners in video understanding and generation, this work provides a scalable method to produce high-quality, structured captions that improve both VLM performance and text-to-video generation control.
Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem et al.
This addresses the need for modular and efficient control in audio-visual generation for researchers and practitioners, offering a significant improvement over monolithic or costly methods.
Zeyue Tian, Binxin Yang, Zhaoyang Liu et al.
This work addresses the lack of a unified framework for audio generation, editing, and understanding, providing a versatile solution that matches specialized models across multiple domains.
Jizhan Fang, Buqiang Xu, Zhixian Wang et al.
For LLM agents operating in dynamic environments, FluxMem addresses the brittleness of static memory by enabling adaptive connectivity evolution, leading to consistent SOTA results across diverse benchmarks.
Yueqian Lin, Jingyang Zhang, Qinsi Wang et al.
This addresses the problem of temporal integration and cross-modal associations in computational systems for researchers in multimodal AI, though it appears incremental as it builds on known hippocampal mechanisms.
Aditi, Niket Agarwal, Arslan Ali et al.
This work provides a scalable, general-purpose backbone for embodied agents by unifying multiple modalities into a single framework, which is a significant step for Physical AI research.
Qi Cai, Jingwen Chen, Chengmin Gao et al.
This work provides a scalable, end-to-end unified architecture for multimodal image generation and editing, potentially simplifying and advancing the field of visual generative AI.
Shuang Chen, Quanxin Shou, Hangting Chen et al.
This work addresses the challenge of real-world image generation involving culturally significant and long-tail factual concepts for applications requiring external knowledge grounding, representing an early exploration of agent-based modeling in this domain.
Yuxuan He, Chaiming Huang, Yifan Wu et al.
This addresses the need for better structural understanding in video AI for applications like editing and recommendation, representing a novel method for a known bottleneck.
Yuxuan Bian, Zeyue Xue, Songchun Zhang et al.
This work addresses the fundamental memory bottleneck in infinite video generation, offering a practical path toward real-time, unbounded-length video synthesis for content creation and simulation applications.
Weitong Cai, Hang Zhang, Yukai Huang et al.
This enables practical always-on video sensing on resource-constrained mobile and edge devices by reducing sensing and inference costs.
Kangan Qian, ChuChu Xie, Yang Zhong et al.
This work addresses the lack of geometric reasoning in cloud-based VLMs for embodied AI, enabling more accurate scene understanding and generalization in large-scale environments.
Xiaomin Yu, Yi Xin, Yuhui Zhang et al.
For researchers working on multimodal large language models, this work offers a method to reduce reliance on costly paired data while improving alignment, though the gains are incremental over existing approaches.
Bernini Team, Chenchen Liu, Junyi Chen et al.
This work addresses the need for semantically grounded video generation and editing by combining the reasoning capabilities of MLLMs with the fidelity of diffusion models, offering a unified framework that outperforms existing methods.
Junchao Liao, Zhenghao Zhang, Xiangyu Meng et al.
This addresses the challenge of physical coherence in audio-video generation for applications like media production, though it is incremental as it builds on existing methods with a novel kinematic prior.
Zachary Novack, Stephen Brade, Haven Kim et al.
This work addresses the computational bottleneck of using diffusion models for interactive music generation, making them practical for live performance on consumer hardware, which was previously only feasible with industrial-scale compute.
Zijun Cui, Xiulong Liu, Hao Fang et al.
For researchers and developers of joint audio-video generation models, this work identifies cross-modal physical consistency and transition dynamics as critical open challenges.
Kejuan Yang, Yizhuo Zhang, Mingyuan Du et al.
For industrial-scale video moderation, UNIVID provides an interpretable and efficient alternative to fragmented black-box classifiers, reducing maintenance overhead and improving moderation accuracy.
Ke Li, Maoliang Li, Jialiang Chen et al.
This addresses the challenge of creating professional-grade video mashups automatically, which is incremental as it builds on existing editing frameworks by focusing on multimodal orchestration.