CVJan 7, 2025Code
Cosmos World Foundation Model Platform for Physical AINiket Agarwal, Arslan Ali, Maciej Bala et al. · nvidia
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
CVJun 3Code
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?Rui Zhao, Kaiming Yang, Jifeng Zhu et al.
Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.
CVApr 6, 2023Code
$R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place RecognitionSijie Zhu, Linjie Yang, Chen Chen et al.
Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named $R^{2}$Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, $R^{2}$Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former.
CVOct 28, 2025Code
World Simulation with Video Foundation Models for Physical AIArslan Ali, Junjie Bai, Maciej Bala et al. · nvidia
We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
CVApr 12, 2022Code
Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise AffinityWeiyao Wang, Matt Feiszli, Heng Wang et al.
Open-world instance segmentation is the task of grouping pixels into object instances without any pre-determined taxonomy. This is challenging, as state-of-the-art methods rely on explicit class semantics obtained from large labeled datasets, and out-of-domain evaluation performance drops significantly. Here we propose a novel approach for mask proposals, Generic Grouping Networks (GGNs), constructed without semantic supervision. Our approach combines a local measure of pixel affinity with instance-level mask supervision, producing a training regimen designed to make the model as generic as the data diversity allows. We introduce a method for predicting Pairwise Affinities (PA), a learned local relationship between pairs of pixels. PA generalizes very well to unseen categories. From PA we construct a large set of pseudo-ground-truth instance masks; combined with human-annotated instance masks we train GGNs and significantly outperform the SOTA on open-world instance segmentation on various benchmarks including COCO, LVIS, ADE20K, and UVO. Code is available on project website: https://sites.google.com/view/generic-grouping/.
CVJul 22, 2023Code
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?Cheng-En Wu, Yu Tian, Haichao Yu et al.
Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.
CVOct 15, 2022Code
PointNeuron: 3D Neuron Reconstruction via Geometry and Topology Learning of Point CloudsRunkai Zhao, Heng Wang, Chaoyi Zhang et al.
Digital neuron reconstruction from 3D microscopy images is an essential technique for investigating brain connectomics and neuron morphology. Existing reconstruction frameworks use convolution-based segmentation networks to partition the neuron from noisy backgrounds before applying the tracing algorithm. The tracing results are sensitive to the raw image quality and segmentation accuracy. In this paper, we propose a novel framework for 3D neuron reconstruction. Our key idea is to use the geometric representation power of the point cloud to better explore the intrinsic structural information of neurons. Our proposed framework adopts one graph convolutional network to predict the neural skeleton points and another one to produce the connectivity of these points. We finally generate the target SWC file through the interpretation of the predicted point coordinates, radius, and connections. Evaluated on the Janelia-Fly dataset from the BigNeuron project, we show that our framework achieves competitive neuron reconstruction performance. Our geometry and topology learning of point clouds could further benefit 3D medical image analysis, such as cardiac surface reconstruction. Our code is available at https://github.com/RunkaiZhao/PointNeuron.
CVNov 29, 2022Code
One is All: Bridging the Gap Between Neural Radiance Fields Architectures with Progressive Volume DistillationShuangkang Fang, Weixin Xu, Heng Wang et al.
Neural Radiance Fields (NeRF) methods have proved effective as compact, high-quality and versatile representations for 3D scenes, and enable downstream tasks such as editing, retrieval, navigation, etc. Various neural architectures are vying for the core structure of NeRF, including the plain Multi-Layer Perceptron (MLP), sparse tensors, low-rank tensors, hashtables and their compositions. Each of these representations has its particular set of trade-offs. For example, the hashtable-based representations admit faster training and rendering but their lack of clear geometric meaning hampers downstream tasks like spatial-relation-aware editing. In this paper, we propose Progressive Volume Distillation (PVD), a systematic distillation method that allows any-to-any conversions between different architectures, including MLP, sparse or low-rank tensors, hashtables and their compositions. PVD consequently empowers downstream applications to optimally adapt the neural representations for the task at hand in a post hoc fashion. The conversions are fast, as distillation is progressively performed on different levels of volume representations, from shallower to deeper. We also employ special treatment of density to deal with its specific numerical instability problem. Empirical evidence is presented to validate our method on the NeRF-Synthetic, LLFF and TanksAndTemples datasets. For example, with PVD, an MLP-based NeRF model can be distilled from a hashtable-based Instant-NGP model at a 10X~20X faster speed than being trained the original NeRF from scratch, while achieving a superior level of synthesis quality. Code is available at https://github.com/megvii-research/AAAI2023-PVD.
CLJun 4
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User ConstraintsJiayu Liu, Cheng Qian, Zhenhailong Wang et al.
Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.
CLOct 2, 2023
Resolving Knowledge Conflicts in Large Language ModelsYike Wang, Shangbin Feng, Heng Wang et al. · berkeley, cmu
Large language models (LLMs) often encounter knowledge conflicts, scenarios where discrepancy arises between the internal parametric knowledge of LLMs and non-parametric information provided in the prompt context. In this work we ask what are the desiderata for LLMs when a knowledge conflict arises and whether existing LLMs fulfill them. We posit that LLMs should 1) identify knowledge conflicts, 2) pinpoint conflicting information segments, and 3) provide distinct answers or viewpoints in conflicting scenarios. To this end, we introduce an evaluation framework for simulating contextual knowledge conflicts and quantitatively evaluating to what extent LLMs achieve these goals. It includes diverse and complex situations of knowledge conflict, knowledge from diverse entities and domains, two synthetic conflict creation methods, and settings with progressively increasing difficulty to reflect realistic knowledge conflicts. Extensive experiments with the framework reveal that while LLMs perform well in identifying the existence of knowledge conflicts, they struggle to determine the specific conflicting knowledge and produce a response with distinct answers amidst conflicting information. To address these challenges, we propose new instruction-based approaches that augment LLMs to better achieve the three goals. Further analysis shows that abilities to tackle knowledge conflicts are greatly impacted by factors such as knowledge domain, while generating robust responses to knowledge conflict scenarios remains an open research question.
CVJul 27, 2023Code
Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence GenerationZhiyuan Li, Dongnan Liu, Heng Wang et al.
Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental results indicate that our method outperforms SOTA captioning models across various settings including zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Code is available at: https://github.com/Zhiyuan-Li-John/RaPSG.
AIApr 22, 2023Code
Detecting Spoilers in Movie Reviews with External Movie Knowledge and User NetworksHeng Wang, Wenqian Zhang, Yuyang Bai et al.
Online movie review platforms are providing crowdsourced feedback for the film industry and the general public, while spoiler reviews greatly compromise user experience. Although preliminary research efforts were made to automatically identify spoilers, they merely focus on the review content itself, while robust spoiler detection requires putting the review into the context of facts and knowledge regarding movies, user behavior on film review platforms, and more. In light of these challenges, we first curate a large-scale network-based spoiler detection dataset LCS and a comprehensive and up-to-date movie knowledge base UKM. We then propose MVSD, a novel Multi-View Spoiler Detection framework that takes into account the external knowledge about movies and user activities on movie review platforms. Specifically, MVSD constructs three interconnecting heterogeneous information networks to model diverse data sources and their multi-view attributes, while we design and employ a novel heterogeneous graph neural network architecture for spoiler detection as node-level classification. Extensive experiments demonstrate that MVSD advances the state-of-the-art on two spoiler detection datasets, while the introduction of external knowledge and user interactions help ground robust spoiler detection. Our data and code are available at https://github.com/Arthur-Heng/Spoiler-Detection
CVJan 18, 2023
Temporal Perceiving Video-Language Pre-trainingFan Ma, Xiaojie Jin, Heng Wang et al.
Video-Language Pre-training models have recently significantly improved various multi-modal downstream tasks. Previous dominant works mainly adopt contrastive learning to achieve global feature alignment across modalities. However, the local associations between videos and texts are not modeled, restricting the pre-training models' generality, especially for tasks requiring the temporal video boundary for certain query texts. This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment such that the trained model can accurately perceive temporal boundaries in videos given the text description. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features. To produce temporal boundaries, frame features in several videos are manually merged into a long video sequence that interacts with a text sequence. With the localization task, our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality. Notably, comprehensive experimental results show that our method significantly improves the state-of-the-art performance on various benchmarks, covering text-to-video retrieval, video question answering, video captioning, temporal action localization and temporal moment retrieval. The code will be released soon.
CVAug 15, 2024Code
Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across ModalitiesZhiyuan Li, Heng Wang, Dongnan Liu et al.
Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose a VcCoT strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning. The project is available at: https://github.com/Zhiyuan-Li-John/MuCR
CVAug 18, 2023
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation ModelsHeng Wang, Jianbo Ma, Santiago Pascual et al.
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.
CVApr 22, 2022
Spatiality-guided Transformer for 3D Dense Captioning on Point CloudsHeng Wang, Chaoyi Zhang, Jianhui Yu et al.
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption generation. Evaluated on two benchmark datasets, ScanRefer and ReferIt3D, our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively. Our project page with source code and supplementary files is available at https://SpaCap3D.github.io/ .
AIDec 18, 2025
Adaptation of Agentic AIPengcheng Jiang, Jiacheng Lin, Zhiyi Shi et al. · stanford
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
SIJun 9, 2022
TwiBot-22: Towards Graph-Based Twitter Bot DetectionShangbin Feng, Zhaoxuan Tan, Herun Wan et al.
Twitter bot detection has become an increasingly important task to combat misinformation, facilitate social media moderation, and preserve the integrity of the online discourse. State-of-the-art bot detection methods generally leverage the graph structure of the Twitter network, and they exhibit promising performance when confronting novel Twitter bots that traditional methods fail to detect. However, very few of the existing Twitter bot detection datasets are graph-based, and even these few graph-based datasets suffer from limited dataset scale, incomplete graph structure, as well as low annotation quality. In fact, the lack of a large-scale graph-based Twitter bot detection benchmark that addresses these issues has seriously hindered the development and evaluation of novel graph-based bot detection approaches. In this paper, we propose TwiBot-22, a comprehensive graph-based Twitter bot detection benchmark that presents the largest dataset to date, provides diversified entities and relations on the Twitter network, and has considerably better annotation quality than existing datasets. In addition, we re-implement 35 representative Twitter bot detection baselines and evaluate them on 9 datasets, including TwiBot-22, to promote a fair comparison of model performance and a holistic understanding of research progress. To facilitate further research, we consolidate all implemented codes and datasets into the TwiBot-22 evaluation framework, where researchers could consistently evaluate new models and datasets. The TwiBot-22 Twitter bot detection benchmark and evaluation framework are publicly available at https://twibot22.github.io/
CVMar 25, 2023
PAniC-3D: Stylized Single-view 3D Reconstruction from Portraits of Anime CharactersShuhong Chen, Kevin Zhang, Yichun Shi et al.
We propose PAniC-3D, a system to reconstruct stylized 3D character heads directly from illustrated (p)ortraits of (ani)me (c)haracters. Our anime-style domain poses unique challenges to single-view reconstruction; compared to natural images of human heads, character portrait illustrations have hair and accessories with more complex and diverse geometry, and are shaded with non-photorealistic contour lines. In addition, there is a lack of both 3D model and portrait illustration data suitable to train and evaluate this ambiguous stylized reconstruction task. Facing these challenges, our proposed PAniC-3D architecture crosses the illustration-to-3D domain gap with a line-filling model, and represents sophisticated geometries with a volumetric radiance field. We train our system with two large new datasets (11.2k Vroid 3D models, 1k Vtuber portrait illustrations), and evaluate on a novel AnimeRecon benchmark of illustration-to-3D pairs. PAniC-3D significantly outperforms baseline methods, and provides data to establish the task of stylized reconstruction from portrait illustrations.
CVApr 8, 2023
Progressive Volume Distillation with Active Learning for Efficient NeRF Architecture ConversionShuangkang Fang, Yufeng Wang, Yi Yang et al.
Neural Radiance Fields (NeRF) have been widely adopted as practical and versatile representations for 3D scenes, facilitating various downstream tasks. However, different architectures, including the plain Multi-Layer Perceptron (MLP), Tensors, low-rank Tensors, Hashtables, and their combinations, entail distinct trade-offs. For instance, representations based on Hashtables enable faster rendering but lack clear geometric meaning, thereby posing challenges for spatial-relation-aware editing. To address this limitation and maximize the potential of each architecture, we propose Progressive Volume Distillation with Active Learning (PVD-AL), a systematic distillation method that enables any-to-any conversion between diverse architectures. PVD-AL decomposes each structure into two parts and progressively performs distillation from shallower to deeper volume representation, leveraging effective information retrieved from the rendering process. Additionally, a three-level active learning technique provides continuous feedback from teacher to student during the distillation process, achieving high-performance outcomes. Experimental evidence showcases the effectiveness of our method across multiple benchmark datasets. For instance, PVD-AL can distill an MLP-based model from a Hashtables-based model at a 10~20X faster speed and 0.8dB~2dB higher PSNR than training the MLP-based model from scratch. Moreover, PVD-AL permits the fusion of diverse features among distinct structures, enabling models with multiple editing properties and providing a more efficient model to meet real-time requirements like mobile devices. Project website: https://sk-fun.fun/PVD-AL.
CVMar 9, 2023
Open-world Instance Segmentation: Top-down Learning with Bottom-up SupervisionTarun Kalluri, Weiyao Wang, Heng Wang et al.
Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy. However, when deployed in the open world, they exhibit notable bias towards seen classes and suffer from significant performance drop. In this work, we propose a novel approach for open world instance segmentation called bottom-Up and top-Down Open-world Segmentation (UDOS) that combines classical bottom-up segmentation algorithms within a top-down learning framework. UDOS first predicts parts of objects using a top-down network trained with weak supervision from bottom-up segmentations. The bottom-up segmentations are class-agnostic and do not overfit to specific taxonomies. The part-masks are then fed into affinity-based grouping and refinement modules to predict robust instance-level segmentations. UDOS enjoys both the speed and efficiency from the top-down architectures and the generalization ability to unseen categories from bottom-up supervision. We validate the strengths of UDOS on multiple cross-category as well as cross-dataset transfer tasks from 5 challenging datasets including MS-COCO, LVIS, ADE20k, UVO and OpenImages, achieving significant improvements over state-of-the-art across the board. Our code and models are available on our project page.
CVNov 2, 2023
GPT-4V(ision) as a Generalist Evaluator for Vision-Language TasksXinlu Zhang, Yujie Lu, Weizhi Wang et al.
Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. We employ two evaluation methods, single-answer grading and pairwise comparison, using GPT-4V. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators. Despite limitations like restricted visual clarity grading and real-world complex reasoning, its ability to provide human-aligned scores enriched with detailed explanations is promising for universal automatic evaluator.
CLFeb 3Code
Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented GenerationJiashuo Sun, Pengcheng Jiang, Saizhuo Wang et al.
Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator's Goldilocks Zone -- evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at https://github.com/GasolSun36/BAR-RAG.
CVSep 24, 2023
Advancements in 3D Lane Detection Using LiDAR Point Clouds: From Data Collection to Model DevelopmentRunkai Zhao, Yuwen Heng, Heng Wang et al.
Advanced Driver-Assistance Systems (ADAS) have successfully integrated learning-based techniques into vehicle perception and decision-making. However, their application in 3D lane detection for effective driving environment perception is hindered by the lack of comprehensive LiDAR datasets. The sparse nature of LiDAR point cloud data prevents an efficient manual annotation process. To solve this problem, we present LiSV-3DLane, a large-scale 3D lane dataset that comprises 20k frames of surround-view LiDAR point clouds with enriched semantic annotation. Unlike existing datasets confined to a frontal perspective, LiSV-3DLane provides a full 360-degree spatial panorama around the ego vehicle, capturing complex lane patterns in both urban and highway environments. We leverage the geometric traits of lane lines and the intrinsic spatial attributes of LiDAR data to design a simple yet effective automatic annotation pipeline for generating finer lane labels. To propel future research, we propose a novel LiDAR-based 3D lane detection model, LiLaDet, incorporating the spatial geometry learning of the LiDAR point cloud into Bird's Eye View (BEV) based lane identification. Experimental results indicate that LiLaDet outperforms existing camera- and LiDAR-based approaches in the 3D lane detection task on the K-Lane dataset and our LiSV-3DLane.
CVJun 21, 2023
Exploring the Role of Audio in Video CaptioningYuhan Shen, Linjie Yang, Longyin Wen et al.
Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieved significant improvement, the audio modality is often ignored in video captioning. In this work, we present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. Instead of relying on text transcripts extracted via automatic speech recognition (ASR), we argue that learning with raw audio signals can be more beneficial, as audio has additional information including acoustic events, speaker identity, etc. Our contributions are twofold. First, we observed that the model overspecializes to the audio modality when pre-training with both video and audio modality, since the ground truth (i.e., text transcripts) can be solely predicted using audio. We proposed a Modality Balanced Pre-training (MBP) loss to mitigate this issue and significantly improve the performance on downstream tasks. Second, we slice and dice different design choices of the cross-modal module, which may become an information bottleneck and generate inferior results. We proposed new local-global fusion mechanisms to improve information exchange across audio and video. We demonstrate significant improvements by leveraging the audio modality on four datasets, and even outperform the state of the art on some metrics without relying on the text modality as the input.
MMSep 11, 2024
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music VideosYan-Bo Lin, Yu Tian, Linjie Yang et al.
We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html
CVSep 14, 2023
Dataset Condensation via Generative ModelDavid Junhao Zhang, Heng Wang, Chuhui Xue et al.
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation methods from scaling up to large datasets with diverse classes. Moreover, the relations among condensed samples have been neglected and hence the feature distribution of condensed samples is often not diverse. To solve these problems, we propose to condense the dataset into another format, a generative model. Such a novel format allows for the condensation of large datasets because the size of the generative model remains relatively stable as the number of classes or image resolution increases. Furthermore, an intra-class and an inter-class loss are proposed to model the relation of condensed samples. Intra-class loss aims to create more diverse samples for each class by pushing each sample away from the others of the same class. Meanwhile, inter-class loss increases the discriminability of samples by widening the gap between the centers of different classes. Extensive comparisons with state-of-the-art methods and our ablation studies confirm the effectiveness of our method and its individual component. To our best knowledge, we are the first to successfully conduct condensation on ImageNet-1k.
CVSep 30, 2024Code
DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy PredictionZhen Yang, Yanpeng Dong, Jiayu Wang et al.
Multi-sensor fusion significantly enhances the accuracy and robustness of 3D semantic occupancy prediction, which is crucial for autonomous driving and robotics. However, most existing approaches depend on high-resolution images and complex networks to achieve top performance, hindering their deployment in practical scenarios. Moreover, current multi-sensor fusion approaches mainly focus on improving feature fusion while largely neglecting effective supervision strategies for those features. To address these issues, we propose DAOcc, a novel multi-modal occupancy prediction framework that leverages 3D object detection supervision to assist in achieving superior performance, while using a deployment-friendly image backbone and practical input resolution. In addition, we introduce a BEV View Range Extension strategy to mitigate performance degradation caused by lower image resolution. Extensive experiments demonstrate that DAOcc achieves new state-of-the-art results on both the Occ3D-nuScenes and Occ3D-Waymo benchmarks, and outperforms previous state-of-the-art methods by a significant margin using only a ResNet-50 backbone and 256*704 input resolution. With TensorRT optimization, DAOcc reaches 104.9 FPS while maintaining 54.2 mIoU on an NVIDIA RTX 4090 GPU. Code is available at https://github.com/AlphaPlusTT/DAOcc.
CLFeb 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active ParametersAilin Huang, Ang Li, Aobo Kong et al.
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
SDJun 1, 2022
Towards Generalisable Audio Representations for Audio-Visual NavigationShunqi Mao, Chaoyi Zhang, Heng Wang et al.
In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation performance with preciously designed path planning or intricate task settings, none has improved the model generalisation on unheard sounds with task settings unchanged. We thus propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder, where the sound-agnostic goal-driven latent representations can be learnt from various audio signals of different classes. In addition, we consider two data augmentation strategies to enrich the training sounds. We demonstrate that our designs can be easily equipped to existing AVN frameworks to obtain an immediate performance gain (13.4%$\uparrow$ in SPL on Replica and 12.2%$\uparrow$ in SPL on MP3D). Our project is available at https://AV-GeN.github.io/.
CVOct 4, 2023
Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion ModelsJianglong Ye, Peng Wang, Kejie Li et al.
Zero-shot novel view synthesis (NVS) from a single image is an essential problem in 3D object understanding. While recent approaches that leverage pre-trained generative models can synthesize high-quality novel views from in-the-wild inputs, they still struggle to maintain 3D consistency across different views. In this paper, we present Consistent-1-to-3, which is a generative framework that significantly mitigates this issue. Specifically, we decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We design a scene representation transformer and view-conditioned diffusion model for performing these two stages respectively. Inside the models, to enforce 3D consistency, we propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information. Finally, we design a hierarchy generation paradigm to generate long sequences of consistent views, allowing a full 360-degree observation of the provided object image. Qualitative and quantitative evaluation over multiple datasets demonstrates the effectiveness of the proposed mechanisms against state-of-the-art approaches. Our project page is at https://jianglongye.com/consistent123/
AISep 13, 2024
Explaining Datasets in Words: Statistical Models with Natural Language ParametersRuiqi Zhong, Heng Wang, Dan Klein et al.
To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate "discusses COVID". To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g. subareas), and explains sophisticated concepts that classical methods (e.g. n-gram analysis) struggle to produce.
CVSep 21, 2024
Enhancing Advanced Visual Reasoning Ability of Large Language ModelsZhiyuan Li, Dongnan Liu, Chaoyi Zhang et al.
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models' advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose Complex Visual Reasoning Large Language Models (CVR-LLM), capitalizing on VLMs' visual perception proficiency and LLMs' extensive reasoning capability. Unlike recent multimodal large language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs' text knowledge for accurate predictions without extra training. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.
ROMar 6Code
MagRobot:An Open Simulator for Magnetically Navigated RobotsHeng Wang, Haoyu Song, Jiatao Zheng et al.
Magnetic navigation systems, including magnetic tracking systems and magnetic actuation systems, have shown great potential for occlusion-free localization and remote control of intracorporeal medical devices and robots in minimally invasive medicine, such as capsule endoscopy and cardiovascular intervention. However, the design of magnetically navigated robots remains heavily reliant on experimental prototyping, which is time-consuming and costly. Furthermore, there is a lack of a consistent experimental environment to compare and benchmark the hardware and algorithms across different magnetic navigation systems. To address these challenges, we propose the first universal open-source simulation platform to facilitate research, design and benchmarking of magnetically navigated robots. Our simulator features an intuitive graphical user interface that enables the user to efficiently design, visualize, and analyze magnetic navigation systems for both rigid and soft robots. The proposed simulator is versatile, which can simulate both magnetic actuation and magnetic tracking tasks in diverse medical applications that involve deformable anatomies. The proposed simulator provides an open development environment, where the user can load third-party anatomical models and customize both hardware and algorithms of magnetic navigation systems. The fidelity of the simulator is validated using both phantom and ex vivo experiments of magnetic navigation of a continuum robot and a capsule robot with diverse magnetic actuation setups. Three use cases of the simulator, i.e., bronchoscopy, endovascular intervention, and gastrointestinal endoscopy, are implemented to demonstrate the functionality of the simulator. It is shown that the configuration and algorithms of magnetic navigation systems can be flexibly designed and optimized for better performance using the simulator.
CVApr 10, 2025Code
Kimi-VL Technical ReportKimi Team, Angang Du, Bohong Yin et al. · pku, tsinghua
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
CVApr 8, 2022
Canonical Mean Filter for Almost Zero-Shot Multi-Task classificationYong Li, Heng Wang, Xiang Ye
The support set is a key to providing conditional prior for fast adaption of the model in few-shot tasks. But the strict form of support set makes its construction actually difficult in practical application. Motivated by ANIL, we rethink the role of adaption in the feature extractor of CNAPs, which is a state-of-the-art representative few-shot method. To investigate the role, Almost Zero-Shot (AZS) task is designed by fixing the support set to replace the common scheme, which provides corresponding support sets for the different conditional prior of different tasks. The AZS experiment results infer that the adaptation works little in the feature extractor. However, CNAPs cannot be robust to randomly selected support sets and perform poorly on some datasets of Meta-Dataset because of its scattered mean embeddings responded by the simple mean operator. To enhance the robustness of CNAPs, Canonical Mean Filter (CMF) module is proposed to make the mean embeddings intensive and stable in feature space by mapping the support sets into a canonical form. CMFs make CNAPs robust to any fixed support sets even if they are random matrices. This attribution makes CNAPs be able to remove the mean encoder and the parameter adaptation network at the test stage, while CNAP-CMF on AZS tasks keeps the performance with one-shot tasks. It leads to a big parameter reduction. Precisely, 40.48\% parameters are dropped at the test stage. Also, CNAP-CMF outperforms CNAPs in one-shot tasks because it addresses inner-task unstable performance problems. Classification performance, visualized and clustering results verify that CMFs make CNAPs better and simpler.
CVFeb 14, 2025Code
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation ModelGuoqing Ma, Haoyang Huang, Kun Yan et al.
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
CLFeb 17, 2025Code
Step-Audio: Unified Understanding and Generation in Intelligent Speech InteractionAilin Huang, Boyong Wu, Bruce Wang et al.
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
CVNov 20, 2023
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language ModelsXiaotian Han, Quanzeng You, Yongfei Liu et al.
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://infimm.github.io/InfiMM-Eval/
CLJan 29
MoCo: A One-Stop Shop for Model Collaboration ResearchShangbin Feng, Yuyang Bai, Ziyuan Yang et al.
Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.
LGFeb 26, 2025Code
Reward Shaping to Mitigate Reward Hacking in RLHFJiayi Fu, Xuandong Zhao, Chengyuan Yao et al. · berkeley
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR, and the Work done during the internship at StepFun by Jiayi Fu.
CVSep 27, 2023
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data FilteringHaichao Yu, Yu Tian, Sateesh Kumar et al.
The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
CLDec 23, 2025
Step-DeepResearch Technical ReportChen Hu, Haikuo Du, Heng Wang et al.
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
AIAug 12, 2025Code
OpenCUA: Open Foundations for Computer-Use AgentsXinyuan Wang, Bowen Wang, Dunjie Lu et al. · cmu
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
CVMar 15, 2025Code
ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving ObjectZhe Shan, Yang Liu, Lei Zhou et al.
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM's generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at https://github.com/ShanZard/ROS-SAM.
CLMar 28
Learning to Predict Future-Aligned Research Proposals with Language ModelsHeng Wang, Pengcheng Jiang, Jiashuo Sun et al.
Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.
CVSep 6, 2024
Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud SequencesRui Yu, Runkai Zhao, Cong Nie et al.
Accurate and robust LiDAR 3D object detection is essential for comprehensive scene understanding in autonomous driving. Despite its importance, LiDAR detection performance is limited by inherent constraints of point cloud data, particularly under conditions of extended distances and occlusions. Recently, temporal aggregation has been proven to significantly enhance detection accuracy by fusing multi-frame viewpoint information and enriching the spatial representation of objects. In this work, we introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information. We aim to improve the spatial-temporal interpretation capabilities of the LiDAR detector by incorporating a dynamic prior, generated from a non-learnable motion estimation model. Specifically, Motion-Guided Feature Aggregation (MGFA) is proposed to utilize the object trajectory from previous and future motion states to model spatial-temporal correlations into gaussian heatmap over a driving sequence. This motion-based heatmap then guides the temporal feature fusion, enriching the proposed object features. Moreover, we design a Dual Correlation Weighting Module (DCWM) that effectively facilitates the interaction between past and prospective frames through scene- and channel-wise feature abstraction. In the end, a cascade cross-attention-based decoder is employed to refine the 3D prediction. We have conducted experiments on the Waymo and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance with effective spatial-temporal feature learning.
CLDec 11, 2024Code
Fast Prompt Alignment for Text-to-Image GenerationKhalil Mrini, Hanlin Lu, Linjie Yang et al.
Text-to-image generation has advanced rapidly, yet aligning complex textual prompts with generated visuals remains challenging, especially with intricate object relationships and fine-grained details. This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach, enhancing text-to-image alignment efficiency without the iterative overhead typical of current methods like OPT2I. FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts to enable real-time inference, reducing computational demands while preserving alignment fidelity. Extensive evaluations on the COCO Captions and PartiPrompts datasets demonstrate that FPA achieves competitive text-image alignment scores at a fraction of the processing time, as validated through both automated metrics (TIFA, VQA) and human evaluation. A human study with expert annotators further reveals a strong correlation between human alignment judgments and automated scores, underscoring the robustness of FPA's improvements. The proposed method showcases a scalable, efficient alternative to iterative prompt optimization, enabling broader applicability in real-time, high-demand settings. The codebase is provided to facilitate further research: https://github.com/tiktok/fast_prompt_alignment
AIMar 8, 2024Code
MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-ExpertsZinan Zeng, Sen Ye, Zijian Cai et al.
Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review's textual content, and the review's metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE's superiority in robustness and generalization. Our code is available at https://github.com/zzqbjt/Spoiler-Detection.
CLApr 1, 2025Code
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic ResourcesWeizhi Wang, Yu Tian, Linjie Yang et al.
The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 220 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.