92.4CVJun 2
Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMsWei Ding, Yudong Zhang, Ruobing Xie et al.
As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.
CVAug 7, 2023Code
Learning a Graph Neural Network with Cross Modality Interaction for Image FusionJiawei Li, Jiansheng Chen, Jinyuan Liu et al.
Infrared and visible image fusion has gradually proved to be a vital fork in the field of multi-modality imaging technologies. In recent developments, researchers not only focus on the quality of fused images but also evaluate their performance in downstream tasks. Nevertheless, the majority of methods seldom put their eyes on the mutual learning from different modalities, resulting in fused images lacking significant details and textures. To overcome this issue, we propose an interactive graph neural network (GNN)-based architecture between cross modality for fusion, called IGNet. Specifically, we first apply a multi-scale extractor to achieve shallow features, which are employed as the necessary input to build graph structures. Then, the graph interaction module can construct the extracted intermediate features of the infrared/visible branch into graph structures. Meanwhile, the graph structures of two branches interact for cross-modality and semantic learning, so that fused images can maintain the important feature expressions and enhance the performance of downstream tasks. Besides, the proposed leader nodes can improve information propagation in the same modality. Finally, we merge all graph features to get the fusion result. Extensive experiments on different datasets (TNO, MFNet and M3FD) demonstrate that our IGNet can generate visually appealing fused images while scoring averagely 2.59% mAP@.5 and 7.77% mIoU higher in detection and segmentation than the compared state-of-the-art methods. The source code of the proposed IGNet can be available at https://github.com/lok-18/IGNet.
CVSep 8, 2024Code
PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe QuestionsYudong Zhang, Ruobing Xie, Jiansheng Chen et al. · tsinghua
Large Vision-Language Models (LVLMs) have demonstrated their powerful multimodal capabilities. However, they also face serious safety problems, as adversaries can induce robustness issues in LVLMs through the use of well-designed adversarial examples. Therefore, LVLMs are in urgent need of detection tools for adversarial examples to prevent incorrect responses. In this work, we first discover that LVLMs exhibit regular attention patterns for clean images when presented with probe questions. We propose an unconventional method named PIP, which utilizes the attention patterns of one randomly selected irrelevant probe question (e.g., "Is there a clock?") to distinguish adversarial examples from clean examples. Regardless of the image to be tested and its corresponding question, PIP only needs to perform one additional inference of the image to be tested and the probe question, and then achieves successful detection of adversarial examples. Even under black-box attacks and open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more than 98% recall and a precision of over 90%. Our PIP is the first attempt to detect adversarial attacks on LVLMs via simple irrelevant probe questions, shedding light on deeper understanding and introspection within LVLMs. The code is available at https://github.com/btzyd/pip.
CVJul 16, 2024Code
UrbanWorld: An Urban World Model for 3D City GenerationYu Shang, Yuming Lin, Yu Zheng et al.
Cities, as the essential environment of human life, encompass diverse physical elements such as buildings, roads and vegetation, which continuously interact with dynamic entities like people and vehicles. Crafting realistic, interactive 3D urban environments is essential for nurturing AGI systems and constructing AI agents capable of perceiving, decision-making, and acting like humans in real-world environments. However, creating high-fidelity 3D urban environments usually entails extensive manual labor from designers, involving intricate detailing and representation of complex urban elements. Therefore, accomplishing this automatically remains a longstanding challenge. Toward this problem, we propose UrbanWorld, the first generative urban world model that can automatically create a customized, realistic and interactive 3D urban world with flexible control conditions. UrbanWorld incorporates four key stages in the generation pipeline: flexible 3D layout generation from OSM data or urban layout with semantic and height maps, urban scene design with Urban MLLM, controllable urban asset rendering via progressive 3D diffusion, and MLLM-assisted scene refinement. We conduct extensive quantitative analysis on five visual metrics, demonstrating that UrbanWorld achieves SOTA generation realism. Next, we provide qualitative results about the controllable generation capabilities of UrbanWorld using both textual and image-based prompts. Lastly, we verify the interactive nature of these environments by showcasing the agent perception and navigation within the created environments. We contribute UrbanWorld as an open-source tool available at https://github.com/Urban-World/UrbanWorld.
CVNov 3, 2025Code
MVSMamba: Multi-View Stereo with State Space ModelJianfei Jiang, Qiankun Liu, Hongyuan Liu et al.
Robust feature representations are essential for learning-based Multi-View Stereo (MVS), which relies on accurate feature matching. Recent MVS methods leverage Transformers to capture long-range dependencies based on local features extracted by conventional feature pyramid networks. However, the quadratic complexity of Transformer-based MVS methods poses challenges to balance performance and efficiency. Motivated by the global modeling capability and linear complexity of the Mamba architecture, we propose MVSMamba, the first Mamba-based MVS network. MVSMamba enables efficient global feature aggregation with minimal computational overhead. To fully exploit Mamba's potential in MVS, we propose a Dynamic Mamba module (DM-module) based on a novel reference-centered dynamic scanning strategy, which enables: (1) Efficient intra- and inter-view feature interaction from the reference to source views, (2) Omnidirectional multi-view feature representations, and (3) Multi-scale global feature aggregation. Extensive experimental results demonstrate MVSMamba outperforms state-of-the-art MVS methods on the DTU dataset and the Tanks-and-Temples benchmark with both superior performance and efficiency. The source code is available at https://github.com/JianfeiJ/MVSMamba.
CVFeb 8, 2023Code
Gestalt-Guided Image Understanding for Few-Shot LearningKun Song, Yuchen Wu, Jiansheng Chen et al.
Due to the scarcity of available data, deep learning does not perform well on few-shot learning tasks. However, human can quickly learn the feature of a new category from very few samples. Nevertheless, previous work has rarely considered how to mimic human cognitive behavior and apply it to few-shot learning. This paper introduces Gestalt psychology to few-shot learning and proposes Gestalt-Guided Image Understanding, a plug-and-play method called GGIU. Referring to the principle of totality and the law of closure in Gestalt psychology, we design Totality-Guided Image Understanding and Closure-Guided Image Understanding to extract image features. After that, a feature estimation module is used to estimate the accurate features of images. Extensive experiments demonstrate that our method can improve the performance of existing models effectively and flexibly without retraining or fine-tuning. Our code is released on https://github.com/skingorz/GGIU.
CVNov 7, 2022
Few-shot Image Generation with Diffusion ModelsJingyuan Zhu, Huimin Ma, Jiansheng Chen et al.
Denoising diffusion probabilistic models (DDPMs) have been proven capable of synthesizing high-quality images with remarkable diversity when trained on large amounts of data. However, to our knowledge, few-shot image generation tasks have yet to be studied with DDPM-based approaches. Modern approaches are mainly built on Generative Adversarial Networks (GANs) and adapt models pre-trained on large source domains to target domains using a few available samples. In this paper, we make the first attempt to study when do DDPMs overfit and suffer severe diversity degradation as training data become scarce. Then we fine-tune DDPMs pre-trained on large source domains to solve the overfitting problem when training data is limited. Although the directly fine-tuned models accelerate convergence and improve generation quality and diversity compared with training from scratch, they still fail to retain some diverse features and can only produce coarse images. Therefore, we design a DDPM pairwise adaptation (DDPM-PA) approach to optimize few-shot DDPM domain adaptation. DDPM-PA efficiently preserves information learned from source domains by keeping the relative pairwise distances between generated samples during adaptation. Besides, DDPM-PA enhances the learning of high-frequency details from source models and limited training data. DDPM-PA further improves generation quality and diversity and achieves results better than current state-of-the-art GAN-based approaches. We demonstrate the effectiveness of our approach on a series of few-shot image generation tasks qualitatively and quantitatively.
CVOct 27, 2022
Few-shot Image Generation via Masked DiscriminationJingyuan Zhu, Huimin Ma, Jiansheng Chen et al.
Few-shot image generation aims to generate images of high quality and great diversity with limited data. However, it is difficult for modern GANs to avoid overfitting when trained on only a few images. The discriminator can easily remember all the training samples and guide the generator to replicate them, leading to severe diversity degradation. Several methods have been proposed to relieve overfitting by adapting GANs pre-trained on large source domains to target domains using limited real samples. This work presents a novel approach to realize few-shot GAN adaptation via masked discrimination. Random masks are applied to features extracted by the discriminator from input images. We aim to encourage the discriminator to judge various images which share partially common features with training samples as realistic. Correspondingly, the generator is guided to generate diverse images instead of replicating training samples. In addition, we employ a cross-domain consistency loss for the discriminator to keep relative distances between generated samples in its feature space. It strengthens global image discrimination and guides adapted GANs to preserve more information learned from source domains for higher image quality. The effectiveness of our approach is demonstrated both qualitatively and quantitatively with higher quality and greater diversity on a series of few-shot image generation tasks than prior methods.
CVJun 25, 2023
DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited DataJingyuan Zhu, Huimin Ma, Jiansheng Chen et al.
Denoising diffusion probabilistic models (DDPMs) have been proven capable of synthesizing high-quality images with remarkable diversity when trained on large amounts of data. Typical diffusion models and modern large-scale conditional generative models like text-to-image generative models are vulnerable to overfitting when fine-tuned on extremely limited data. Existing works have explored subject-driven generation using a reference set containing a few images. However, few prior works explore DDPM-based domain-driven generation, which aims to learn the common features of target domains while maintaining diversity. This paper proposes a novel DomainStudio approach to adapt DDPMs pre-trained on large-scale source datasets to target domains using limited data. It is designed to keep the diversity of subjects provided by source domains and get high-quality and diverse adapted samples in target domains. We propose to keep the relative distances between adapted samples to achieve considerable generation diversity. In addition, we further enhance the learning of high-frequency details for better generation quality. Our approach is compatible with both unconditional and conditional diffusion models. This work makes the first attempt to realize unconditional few-shot image generation with diffusion models, achieving better quality and greater diversity than current state-of-the-art GAN-based approaches. Moreover, this work also significantly relieves overfitting for conditional generation and realizes high-quality domain-driven generation, further expanding the applicable scenarios of modern large-scale text-to-image models.
CVFeb 14, 2025Code
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation ModelGuoqing Ma, Haoyang Huang, Kun Yan et al.
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
85.8ROMay 23
PACT: Proactive Asking for Continual Task Assistance in Human-Robot CollaborationChengbo He, Sheng Li, Chenyang Ma et al.
Robotic assistants in long-term human-robot collaboration need to assist users under partial observations while leveraging cross-day interaction history. However, human traits and routines are often unknown at the beginning of collaboration, making passive infer-then-act assistance ineffective and inefficient. To address this challenge, we study a cross-day proactive asking setting for continual task assistance and propose PACT (Proactive Asking for Continual Task Assistance), an ask-or-act framework that determines whether clarification should be sought before taking action. PACT leverages current observations together with accumulated interaction history to evaluate contextual sufficiency, enabling the robot to provide more reliable assistance and progressively adapt to the user over time. We implement its primary learned instantiation using reinforcement learning and evaluate alternative instantiations under the same framework. To assess such behavior, we further introduce a clarification utility metric that quantifies the trade-off between assistance accuracy and the frequency of clarification requests. Experiments in multi-day embodied collaboration scenarios demonstrate that, compared with passive inference baselines, PACT consistently improves both assistance accuracy and clarification utility, highlighting the importance of proactive asking in continual human-robot collaboration.
CVMar 6, 2023
MotionVideoGAN: A Novel Video Generator Based on the Motion Space Learned from Image PairsJingyuan Zhu, Huimin Ma, Jiansheng Chen et al.
Video generation has achieved rapid progress benefiting from high-quality renderings provided by powerful image generators. We regard the video synthesis task as generating a sequence of images sharing the same contents but varying in motions. However, most previous video synthesis frameworks based on pre-trained image generators treat content and motion generation separately, leading to unrealistic generated videos. Therefore, we design a novel framework to build the motion space, aiming to achieve content consistency and fast convergence for video generation. We present MotionVideoGAN, a novel video generator synthesizing videos based on the motion space learned by pre-trained image pair generators. Firstly, we propose an image pair generator named MotionStyleGAN to generate image pairs sharing the same contents and producing various motions. Then we manage to acquire motion codes to edit one image in the generated image pairs and keep the other unchanged. The motion codes help us edit images within the motion space since the edited image shares the same contents with the other unchanged one in image pairs. Finally, we introduce a latent code generator to produce latent code sequences using motion codes for video generation. Our approach achieves state-of-the-art performance on the most complex video dataset ever used for unconditional video generation evaluation, UCF101.
CLFeb 17, 2025Code
Step-Audio: Unified Understanding and Generation in Intelligent Speech InteractionAilin Huang, Boyong Wu, Bruce Wang et al.
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
LGSep 25, 2024
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised TrainingKun Song, Zhiquan Tan, Bochao Zou et al.
In this paper, we introduce matrix entropy as an analytical tool for studying supervised learning, investigating the information content of data representations and classification head vectors, as well as the dynamic interactions between them during the supervised learning process. Our experimental results reveal that matrix entropy effectively captures the variations in information content of data representations and classification head vectors as neural networks approach Neural Collapse during supervised training, while also serving as a robust metric for measuring similarity among data samples. Leveraging this property, we propose Cross-Model Alignment (CMA) loss to optimize the fine-tuning of pretrained models. To characterize the dynamics of neural networks nearing the Neural Collapse state, we introduce two novel metrics: the Matrix Mutual Information Ratio (MIR) and the Matrix Entropy Difference Ratio (HDR), which quantitatively assess the interactions between data representations and classification heads in supervised learning, with theoretical optimal values derived under the Neural Collapse state. Our experiments demonstrate that MIR and HDR effectively explain various phenomena in neural networks, including the dynamics of standard supervised training, linear mode connectivity. Moreover, we use MIR and HDR to analyze the dynamics of grokking, which is a fascinating phenomenon in supervised learning where a model unexpectedly exhibits generalization long after achieving training data fit.
CLJul 22, 2025Code
Step-Audio 2 Technical ReportBoyong Wu, Chao Yan, Chen Hu et al.
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
CVJan 17, 2023
Distribution Aligned Feature Clustering for Zero-Shot Sketch-Based Image RetrievalYuchen Wu, Kun Song, Fangzheng Zhao et al.
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a challenging cross-modal retrieval task. In prior arts, the retrieval is conducted by sorting the distance between the query sketch and each image in the gallery. However, the domain gap and the zero-shot setting make neural networks hard to generalize. This paper tackles the challenges from a new perspective: utilizing gallery image features. We propose a Cluster-then-Retrieve (ClusterRetri) method that performs clustering on the gallery images and uses the cluster centroids as proxies for retrieval. Furthermore, a distribution alignment loss is proposed to align the image and sketch features with a common Gaussian distribution, reducing the domain gap. Despite its simplicity, our proposed method outperforms the state-of-the-art methods by a large margin on popular datasets, e.g., up to 31% and 39% relative improvement of mAP@all on the Sketchy and TU-Berlin datasets.
CLDec 23, 2025
Step-DeepResearch Technical ReportChen Hu, Haikuo Du, Heng Wang et al.
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
CVFeb 20, 2024Code
RhythmFormer: Extracting Patterned rPPG Signals based on Periodic Sparse AttentionBochao Zou, Zizheng Guo, Jiansheng Chen et al.
Remote photoplethysmography (rPPG) is a non-contact method for detecting physiological signals based on facial videos, holding high potential in various applications. Due to the periodicity nature of rPPG signals, the long-range dependency capturing capacity of the transformer was assumed to be advantageous for such signals. However, existing methods have not conclusively demonstrated the superior performance of transformers over traditional convolutional neural networks. This may be attributed to the quadratic scaling exhibited by transformer with sequence length, resulting in coarse-grained feature extraction, which in turn affects robustness and generalization. To address that, this paper proposes a periodic sparse attention mechanism based on temporal attention sparsity induced by periodicity. A pre-attention stage is introduced before the conventional attention mechanism. This stage learns periodic patterns to filter out a large number of irrelevant attention computations, thus enabling fine-grained feature extraction. Moreover, to address the issue of fine-grained features being more susceptible to noise interference, a fusion stem is proposed to effectively guide self-attention towards rPPG features. It can be easily integrated into existing methods to enhance their performance. Extensive experiments show that the proposed method achieves state-of-the-art performance in both intra-dataset and cross-dataset evaluations. The codes are available at https://github.com/zizheng-guo/RhythmFormer.
CVMar 14, 2025Code
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation ModelHaoyang Huang, Guoqing Ma, Nan Duan et al.
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
51.6CVMar 25
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language ModelsSiqi Liu, Xinyang Li, Bochao Zou et al.
As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.
43.2CVMar 26
Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation ModelingShiji Zhao, Shukun Xiong, Maoxun Yuan et al.
In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.
CVJun 19, 2025Code
AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving ScenariosYunhao Hou, Bochao Zou, Min Zhang et al.
By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 80K LiDAR frames and 360K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 17% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 350 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.
CVApr 15, 2025Code
QAVA: Query-Agnostic Visual Attack to Large Vision-Language ModelsYudong Zhang, Ruobing Xie, Jiansheng Chen et al. · tsinghua
In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at https://github.com/btzyd/qava.
39.3CVMay 14
MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMsWei Ding, Yilin Li, Yudong Zhang et al.
Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.
CVNov 27, 2024Code
DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language ModelsYudong Zhang, Ruobing Xie, Xingwu Sun et al. · tsinghua
Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models. The code is available at https://github.com/btzyd/DHCP.
CVJul 15, 2025Code
MonoMVSNet: Monocular Priors Guided Multi-View Stereo NetworkJianfei Jiang, Qiankun Liu, Haochen Yu et al.
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.
CVJun 1, 2025Code
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMsYudong Zhang, Ruobing Xie, Yiqing Huang et al. · tsinghua
Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive ``fighting fire with fire'' strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code is available at https://github.com/btzyd/F3.
CVMar 25, 2025Code
DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for AlignmentWeizhi Chen, Yupeng Deng, Jin Wei et al.
Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.
CVDec 13, 2024Code
$\textrm{A}^{\textrm{2}}$RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image FusionJiawei Li, Hongwei Yu, Jiansheng Chen et al.
Infrared and visible image fusion (IVIF) is a crucial technique for enhancing visual performance by integrating unique information from different modalities into one fused image. Exiting methods pay more attention to conducting fusion with undisturbed data, while overlooking the impact of deliberate interference on the effectiveness of fusion results. To investigate the robustness of fusion models, in this paper, we propose a novel adversarial attack resilient network, called $\textrm{A}^{\textrm{2}}$RNet. Specifically, we develop an adversarial paradigm with an anti-attack loss function to implement adversarial attacks and training. It is constructed based on the intrinsic nature of IVIF and provide a robust foundation for future research advancements. We adopt a Unet as the pipeline with a transformer-based defensive refinement module (DRM) under this paradigm, which guarantees fused image quality in a robust coarse-to-fine manner. Compared to previous works, our method mitigates the adverse effects of adversarial perturbations, consistently maintaining high-fidelity fusion results. Furthermore, the performance of downstream tasks can also be well maintained under adversarial attacks. Code is available at https://github.com/lok-18/A2RNet.
LGMar 6, 2024Code
Advancing Out-of-Distribution Detection through Data Purification and Dynamic Activation Function DesignYingrui Ji, Yao Zhu, Zhigang Li et al.
In the dynamic realms of machine learning and deep learning, the robustness and reliability of models are paramount, especially in critical real-world applications. A fundamental challenge in this sphere is managing Out-of-Distribution (OOD) samples, significantly increasing the risks of model misclassification and uncertainty. Our work addresses this challenge by enhancing the detection and management of OOD samples in neural networks. We introduce OOD-R (Out-of-Distribution-Rectified), a meticulously curated collection of open-source datasets with enhanced noise reduction properties. In-Distribution (ID) noise in existing OOD datasets can lead to inaccurate evaluation of detection algorithms. Recognizing this, OOD-R incorporates noise filtering technologies to refine the datasets, ensuring a more accurate and reliable evaluation of OOD detection algorithms. This approach not only improves the overall quality of data but also aids in better distinguishing between OOD and ID samples, resulting in up to a 2.5\% improvement in model accuracy and a minimum 3.2\% reduction in false positives. Furthermore, we present ActFun, an innovative method that fine-tunes the model's response to diverse inputs, thereby improving the stability of feature extraction and minimizing specificity issues. ActFun addresses the common problem of model overconfidence in OOD detection by strategically reducing the influence of hidden units, which enhances the model's capability to estimate OOD uncertainty more accurately. Implementing ActFun in the OOD-R dataset has led to significant performance enhancements, including an 18.42\% increase in AUROC of the GradNorm method and a 16.93\% decrease in FPR95 of the Energy method. Overall, our research not only advances the methodologies in OOD detection but also emphasizes the importance of dataset integrity for accurate algorithm evaluation.
CVAug 22, 2025Code
IRSAMap:Towards Large-Scale, High-Resolution Land Cover Map VectorizationYu Meng, Ligao Deng, Zhihao Xi et al.
With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap
CRMay 31, 2025Code
The Security Threat of Compressed Projectors in Large Vision-Language ModelsYudong Zhang, Ruobing Xie, Xingwu Sun et al. · tsinghua
The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offers distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structure information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code is available at https://github.com/btzyd/TCP.
CVSep 30, 2020Code
Teacher-Critical Training Strategies for Image CaptioningYiqing Huang, Jiansheng Chen
Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets. The teacher model is constructed by incorporating the ground-truth image attributes into the baseline caption model. To effectively learn from the teacher model, we propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model. Experimental evaluations of several widely adopted caption models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in both training stages. TCTS is able to achieve to-date the best published single model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy test split. Our codes and pre-trained models will be open-sourced.
CLDec 31, 2024
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agentsChengbo He, Bochao Zou, Xin Li et al.
Agents have demonstrated their potential in scientific reasoning tasks through large language models. However, they often face challenges such as insufficient accuracy and degeneration of thought when handling complex reasoning tasks, which impede their performance. To overcome these issues, we propose the Reactive and Reflection agents with Multi-Path Reasoning (RR-MP) Framework, aimed at enhancing the reasoning capabilities of LLMs. Our approach improves scientific reasoning accuracy by employing a multi-path reasoning mechanism where each path consists of a reactive agent and a reflection agent that collaborate to prevent degeneration of thought inherent in single-agent reliance. Additionally, the RR-MP framework does not require additional training; it utilizes multiple dialogue instances for each reasoning path and a separate summarizer to consolidate insights from all paths. This design integrates diverse perspectives and strengthens reasoning across each path. We conducted zero-shot and few-shot evaluations on tasks involving moral scenarios, college-level physics, and mathematics. Experimental results demonstrate that our method outperforms baseline approaches, highlighting the effectiveness and advantages of the RR-MP framework in managing complex scientific reasoning tasks.
SDJun 10, 2025
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language ModelAilin Huang, Bingxin Li, Bruce Wang et al.
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
76.6CVApr 21
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World ImagesHongyuan Liu, Bochao Zou, Qiankun Liu et al.
Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.
CVMar 25, 2024
Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion GuidanceJingyuan Zhu, Huimin Ma, Jiansheng Chen et al.
Large-scale text-to-image diffusion models have achieved great success in synthesizing high-quality and diverse images given target text prompts. Despite the revolutionary image generation ability, current state-of-the-art models still struggle to deal with multi-concept generation accurately in many cases. This phenomenon is known as ``concept bleeding" and displays as the unexpected overlapping or merging of various concepts. This paper presents a general approach for text-to-image diffusion models to address the mutual interference between different subjects and their attachments in complex scenes, pursuing better text-image consistency. The core idea is to isolate the synthesizing processes of different concepts. We propose to bind each attachment to corresponding subjects separately with split text prompts. Besides, we introduce a revision method to fix the concept bleeding problem in multi-subject synthesis. We first depend on pre-trained object detection and segmentation models to obtain the layouts of subjects. Then we isolate and resynthesize each subject individually with corresponding text prompts to avoid mutual interference. Overall, we achieve a training-free strategy, named Isolated Diffusion, to optimize multi-concept text-to-image synthesis. It is compatible with the latest Stable Diffusion XL (SDXL) and prior Stable Diffusion (SD) models. We compare our approach with alternative methods using a variety of multi-concept text prompts and demonstrate its effectiveness with clear advantages in text-image consistency and user study.
CVMar 31, 2025
CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP GeneralizationYingrui Ji, Xi Xiao, Gaofei Chen et al.
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP's generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.
CVApr 7, 2024
Strictly-ID-Preserved and Controllable Accessory Advertising Image GenerationYouze Xue, Binghui Chen, Yifeng Geng et al.
Customized generative text-to-image models have the ability to produce images that closely resemble a given subject. However, in the context of generating advertising images for e-commerce scenarios, it is crucial that the generated subject's identity aligns perfectly with the product being advertised. In order to address the need for strictly-ID preserved advertising image generation, we have developed a Control-Net based customized image generation pipeline and have taken earring model advertising as an example. Our approach facilitates a seamless interaction between the earrings and the model's face, while ensuring that the identity of the earrings remains intact. Furthermore, to achieve a diverse and controllable display, we have proposed a multi-branch cross-attention architecture, which allows for control over the scale, pose, and appearance of the model, going beyond the limitations of text prompts. Our method manages to achieve fine-grained control of the generated model's face, resulting in controllable and captivating advertising effects.
CVDec 21, 2024
Enhancing Contrastive Learning Inspired by the Philosophy of "The Blind Men and the Elephant"Yudong Zhang, Ruobing Xie, Jiansheng Chen et al. · tsinghua
Contrastive learning is a prevalent technique in self-supervised vision representation learning, typically generating positive pairs by applying two data augmentations to the same image. Designing effective data augmentation strategies is crucial for the success of contrastive learning. Inspired by the story of the blind men and the elephant, we introduce JointCrop and JointBlur. These methods generate more challenging positive pairs by leveraging the joint distribution of the two augmentation parameters, thereby enabling contrastive learning to acquire more effective feature representations. To the best of our knowledge, this is the first effort to explicitly incorporate the joint distribution of two data augmentation parameters into contrastive learning. As a plug-and-play framework without additional computational overhead, JointCrop and JointBlur enhance the performance of SimCLR, BYOL, MoCo v1, MoCo v2, MoCo v3, SimSiam, and Dino baselines with notable improvements.
CVOct 9, 2025
XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting MethodHaochen Yu, Qiankun Liu, Hongyuan Liu et al.
Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.
CVAug 16, 2025
InstDrive: Instance-Aware 3D Gaussian Splatting for Driving ScenesHongyuan Liu, Haochen Yu, Bochao Zou et al.
Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.More visualizations are available at our project page.
CVJul 14, 2025
Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry TexturesXinlong Ding, Hongwei Yu, Jiawei Li et al.
Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
AIJun 17, 2025
From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language ModelsXinyang Li, Siqi Liu, Bochao Zou et al.
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model's exhibited ToM by adjusting in the direction of the attention head.
AIMay 25, 2025
DeCoDe: Defer-and-Complement Decision-Making via Decoupled Concept Bottleneck ModelsChengbo He, Bochao Zou, Junliang Xing et al.
In human-AI collaboration, a central challenge is deciding whether the AI should handle a task, be deferred to a human expert, or be addressed through collaborative effort. Existing Learning to Defer approaches typically make binary choices between AI and humans, neglecting their complementary strengths. They also lack interpretability, a critical property in high-stakes scenarios where users must understand and, if necessary, correct the model's reasoning. To overcome these limitations, we propose Defer-and-Complement Decision-Making via Decoupled Concept Bottleneck Models (DeCoDe), a concept-driven framework for human-AI collaboration. DeCoDe makes strategy decisions based on human-interpretable concept representations, enhancing transparency throughout the decision process. It supports three flexible modes: autonomous AI prediction, deferral to humans, and human-AI collaborative complementarity, selected via a gating network that takes concept-level inputs and is trained using a novel surrogate loss that balances accuracy and human effort. This approach enables instance-specific, interpretable, and adaptive human-AI collaboration. Experiments on real-world datasets demonstrate that DeCoDe significantly outperforms AI-only, human-only, and traditional deferral baselines, while maintaining strong robustness and interpretability even under noisy expert annotations.
LGFeb 15, 2025
E2CB2former: Effecitve and Explainable Transformer for CB2 Receptor Ligand Activity PredictionJiacheng Xie, Yingrui Ji, Linghuan Zeng et al.
Accurate prediction of CB2 receptor ligand activity is pivotal for advancing drug discovery targeting this receptor, which is implicated in inflammation, pain management, and neurodegenerative conditions. Although conventional machine learning and deep learning techniques have shown promise, their limited interpretability remains a significant barrier to rational drug design. In this work, we introduce CB2former, a framework that combines a Graph Convolutional Network with a Transformer architecture to predict CB2 receptor ligand activity. By leveraging the Transformer's self attention mechanism alongside the GCN's structural learning capability, CB2former not only enhances predictive performance but also offers insights into the molecular features underlying receptor activity. We benchmark CB2former against diverse baseline models including Random Forest, Support Vector Machine, K Nearest Neighbors, Gradient Boosting, Extreme Gradient Boosting, Multilayer Perceptron, Convolutional Neural Network, and Recurrent Neural Network and demonstrate its superior performance with an R squared of 0.685, an RMSE of 0.675, and an AUC of 0.940. Moreover, attention weight analysis reveals key molecular substructures influencing CB2 receptor activity, underscoring the model's potential as an interpretable AI tool for drug discovery. This ability to pinpoint critical molecular motifs can streamline virtual screening, guide lead optimization, and expedite therapeutic development. Overall, our results showcase the transformative potential of advanced AI approaches exemplified by CB2former in delivering both accurate predictions and actionable molecular insights, thus fostering interdisciplinary collaboration and innovation in drug discovery.
CVMay 19, 2023
Few-shot 3D Shape GenerationJingyuan Zhu, Huimin Ma, Jiansheng Chen et al.
Realistic and diverse 3D shape generation is helpful for a wide variety of applications such as virtual reality, gaming, and animation. Modern generative models, such as GANs and diffusion models, learn from large-scale datasets and generate new samples following similar data distributions. However, when training data is limited, deep neural generative networks overfit and tend to replicate training samples. Prior works focus on few-shot image generation to produce high-quality and diverse results using a few target images. Unfortunately, abundant 3D shape data is typically hard to obtain as well. In this work, we make the first attempt to realize few-shot 3D shape generation by adapting generative models pre-trained on large source domains to target domains using limited data. To relieve overfitting and keep considerable diversity, we propose to maintain the probability distributions of the pairwise relative distances between adapted samples at feature-level and shape-level during domain adaptation. Our approach only needs the silhouettes of few-shot target samples as training data to learn target geometry distributions and achieve generated shapes with diverse topology and textures. Moreover, we introduce several metrics to evaluate the quality and diversity of few-shot 3D shape generation. The effectiveness of our approach is demonstrated qualitatively and quantitatively under a series of few-shot 3D shape adaptation setups.
CVNov 18, 2020
Shaping Deep Feature Space towards Gaussian Mixture for Visual ClassificationWeitao Wan, Jiansheng Chen, Cheng Yu et al.
The softmax cross-entropy loss function has been widely used to train deep models for various tasks. In this work, we propose a Gaussian mixture (GM) loss function for deep neural networks for visual classification. Unlike the softmax cross-entropy loss, our method explicitly shapes the deep feature space towards a Gaussian Mixture distribution. With a classification margin and a likelihood regularization, the GM loss facilitates both high classification performance and accurate modeling of the feature distribution. The GM loss can be readily used to distinguish abnormal inputs, such as the adversarial examples, based on the discrepancy between feature distributions of the inputs and the training set. Furthermore, theoretical analysis shows that a symmetric feature space can be achieved by using the GM loss, which enables the models to perform robustly against adversarial attacks. The proposed model can be implemented easily and efficiently without using extra trainable parameters. Extensive evaluations demonstrate that the proposed method performs favorably not only on image classification but also on robust detection of adversarial examples generated by strong attacks under different threat models.
CVMar 8, 2018
Rethinking Feature Distribution for Loss Functions in Image ClassificationWeitao Wan, Yuanyi Zhong, Tianpeng Li et al.
We propose a large-margin Gaussian Mixture (L-GM) loss for deep neural networks in classification tasks. Different from the softmax cross-entropy loss, our proposal is established on the assumption that the deep features of the training set follow a Gaussian Mixture distribution. By involving a classification margin and a likelihood regularization, the L-GM loss facilitates both a high classification performance and an accurate modeling of the training feature distribution. As such, the L-GM loss is superior to the softmax loss and its major variants in the sense that besides classification, it can be readily used to distinguish abnormal inputs, such as the adversarial examples, based on their features' likelihood to the training feature distribution. Extensive experiments on various recognition benchmarks like MNIST, CIFAR, ImageNet and LFW, as well as on adversarial examples demonstrate the effectiveness of our proposal.
CVJan 25, 2017
Towards End-to-End Face Recognition through Alignment LearningYuanyi Zhong, Jiansheng Chen, Bo Huang
Plenty of effective methods have been proposed for face recognition during the past decade. Although these methods differ essentially in many aspects, a common practice of them is to specifically align the facial area based on the prior knowledge of human face structure before feature extraction. In most systems, the face alignment module is implemented independently. This has actually caused difficulties in the designing and training of end-to-end face recognition models. In this paper we study the possibility of alignment learning in end-to-end face recognition, in which neither prior knowledge on facial landmarks nor artificially defined geometric transformations are required. Specifically, spatial transformer layers are inserted in front of the feature extraction layers in a Convolutional Neural Network (CNN) for face recognition. Only human identity clues are used for driving the neural network to automatically learn the most suitable geometric transformation and the most appropriate facial area for the recognition task. To ensure reproducibility, our model is trained purely on the publicly available CASIA-WebFace dataset, and is tested on the Labeled Face in the Wild (LFW) dataset. We have achieved a verification accuracy of 99.08\% which is comparable to state-of-the-art single model based methods.