cs.CVComputer Science

Computer Vision

Image recognition, object detection, visual understanding

35.5CVApr 22

Image Generators are Generalist Vision Learners

Valentin Gabeur, Shangbang Long, Songyou Peng et al.

This work suggests a potential paradigm shift in computer vision by positioning generative pretraining as a foundational approach for building generalist vision models that unify generation and understanding tasks.

31.1CVMar 16Code3k

Kimodo: Scaling Controllable Human Motion Generation

Davis Rempe, Mathis Petrovich, Ye Yuan et al.

This addresses the need for scalable, high-quality human motion data for applications in robotics, simulation, and entertainment, representing a significant advancement over previous limited datasets.

31.3CVMar 10Code77

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Zongxia Li, Hongyang Du, Chengsong Huang et al.

This work addresses the challenge of self-evolving multimodal models for AI researchers, offering a scalable approach beyond existing two-model paradigms.

32.7CVMar 13Code307

Multimodal OCR: Parse Anything from Documents

Handong Zheng, Yumeng Li, Kaile Zhang et al.

This addresses the limitation of conventional OCR systems that ignore graphical elements, enabling more comprehensive document parsing for applications like document reconstruction and multimodal pretraining.

34.5CVMar 11Code44

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang, Jianqiang Wan et al.

This work addresses visual perception deficiencies in MLLMs for STEM applications, offering a novel approach that could improve accuracy in domains like science and engineering.

51.1CVMar 28Code11k

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al.

For researchers and practitioners in computer vision, SAM 3 provides a more accurate and unified model for concept-driven segmentation and tracking, with a new benchmark and dataset.

35.0CVMar 29Code464

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Meituan LongCat Team, Bin Xiao, Chao Wang et al.

This work provides a unified approach to multimodal understanding and generation for AI researchers, though it is incremental as it builds on existing NTP and tokenization methods.

28.5CVMar 17

Demystifing Video Reasoning

Ruisi Wang, Zhongang Cai, Fanyi Pu et al.

This provides a systematic understanding of reasoning emergence in video generation models, potentially guiding future research to exploit these dynamics for AI intelligence.

25.7CVMar 10Code291

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen et al.

This work addresses the challenge of democratizing unified multimodal capabilities for AI applications, though it appears incremental as it builds on existing MLLM and MMDiT-based methods.

30.0CVMar 18

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang, Sibo Song, Chenyi Zhou et al.

This addresses the challenge of temporal understanding in video reasoning for vision-language models, offering a more cost-efficient scaling path through synthetic data.

27.9CVMar 19Code25

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi, Jungang Li, Linghao Zhang et al.

This addresses the interaction-memory bottleneck for long-horizon GUI agents, with incremental improvements over existing methods.

29.6CVMar 16Code

HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

Xuerui Qiu, Yutao Cui, Guozhen Zhang et al.

This addresses the problem of information coherence and optimization conflicts in unified multimodal models for AI researchers, representing a novel method rather than an incremental improvement.

26.3CVMar 19Code

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao, Yuhua Zheng, Jia Xu et al.

This addresses a critical gap for real-world applications where videos are typically long, though it is incremental as it focuses on evaluation rather than model development.

27.6CVMar 10Code17

Video-Based Reward Modeling for Computer-Use Agents

Linxin Song, Jieyu Zhang, Huanxin Sheng et al.

This provides a scalable, model-agnostic evaluator for computer-using agents, addressing a key bottleneck in their development and deployment.

28.2CVMar 13

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang, Da Peng, Zonghao Guo et al.

This addresses the challenge of mismatched decoding regimes and representations in multimodal AI, offering an efficient solution for unified tasks, though it appears incremental in building on existing UMM approaches.

27.1CVMar 16

Grounding World Simulation Models in a Real-World Metropolis

Junyoung Seo, Hyunwook Choi, Minkyung Kwon et al.

This work addresses the challenge of creating realistic, dynamic simulations of actual urban environments for applications in urban planning, autonomous systems, or virtual reality, representing a novel method for a known bottleneck rather than a foundational breakthrough.

25.4CVMar 20Code56

PEARL: Personalized Streaming Video Understanding Model

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin et al.

This addresses the limitation of current personalization methods to static/offline data for future AI assistants, though it is incremental as it builds on existing vision-language models.

35.5CVMar 29Code431

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li et al.

This work addresses the bottleneck of 3D spatial understanding in VLMs for monocular video inputs, offering a scalable solution for embodied AI and time-sensitive applications.

27.6CVMar 16

MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

Jinguang Tong, Jinbo Wu, Kaisiyuan Wang et al.

This work addresses a frontier in expressive digital human creation for applications like animation and virtual reality, representing a novel method for a known bottleneck rather than an incremental improvement.

29.2CVApr 14

Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He et al. · nvidia, utoronto

This work tackles the problem of generating large-scale, consistent 3D environments from video for applications in virtual reality and simulation, offering a significant improvement over existing methods that degrade over long trajectories.