CVMar 1, 2023Code
Capturing the motion of every joint: 3D human pose and shape estimation with independent tokensSen Yang, Wen Heng, Gang Liu et al. · tencent-ai
In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model
CVAug 15, 2023Code
ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object DetectionJifeng Shen, Yifei Chen, Yue Liu et al.
Effective feature fusion of multispectral images plays a crucial role in multi-spectral object detection. Previous studies have demonstrated the effectiveness of feature fusion using convolutional neural networks, but these methods are sensitive to image misalignment due to the inherent deffciency in local-range feature interaction resulting in the performance degradation. To address this issue, a novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction and capture complementary information across modalities simultaneously. This framework enhances the discriminability of object features through the query-guided cross-attention mechanism, leading to improved performance. However, stacking multiple transformer blocks for feature enhancement incurs a large number of parameters and high spatial complexity. To handle this, inspired by the human process of reviewing knowledge, an iterative interaction mechanism is proposed to share parameters among block-wise multimodal transformers, reducing model complexity and computation cost. The proposed method is general and effective to be integrated into different detection frameworks and used with different backbones. Experimental results on KAIST, FLIR, and VEDAI datasets show that the proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios. Code will be available at https://github.com/chanchanchan97/ICAFusion.
CVAug 16, 2023Code
GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point CloudsZiyu Li, Jingming Guo, Tongtong Cao et al.
LiDAR-based 3D detection has made great progress in recent years. However, the performance of 3D detectors is considerably limited when deployed in unseen environments, owing to the severe domain gap problem. Existing domain adaptive 3D detection methods do not adequately consider the problem of the distributional discrepancy in feature space, thereby hindering generalization of detectors across domains. In this work, we propose a novel unsupervised domain adaptive \textbf{3D} detection framework, namely \textbf{G}eometry-aware \textbf{P}rototype \textbf{A}lignment (\textbf{GPA-3D}), which explicitly leverages the intrinsic geometric relationship from point cloud objects to reduce the feature discrepancy, thus facilitating cross-domain transferring. Specifically, GPA-3D assigns a series of tailored and learnable prototypes to point cloud objects with distinct geometric structures. Each prototype aligns BEV (bird's-eye-view) features derived from corresponding point cloud objects on source and target domains, reducing the distributional discrepancy and achieving better adaptation. The evaluation results obtained on various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our GPA-3D over the state-of-the-art approaches for different adaptation scenarios. The MindSpore version code will be publicly available at \url{https://github.com/Liz66666/GPA3D}.
CVMar 3, 2022
Correlation-Aware Deep TrackingFei Xie, Chunyu Wang, Guangting Wang et al.
Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements. While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. In contrast to the Siamese-like feature extraction, our network deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it is able to suppress non-target features, resulting in instance-varying feature extraction. The output features of the search image can be directly used for predicting target locations without extra correlation step. Moreover, our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods. Extensive experiments show our method achieves the state-of-the-art results while running at real-time. Our feature networks also can be applied to existing tracking pipelines seamlessly to raise the tracking performance. Code will be available.
CVFeb 23, 2023
EfficientFace: An Efficient Deep Network with Feature Enhancement for Accurate Face DetectionGuangtao Wang, Jun Li, Zhijian Wu et al.
In recent years, deep convolutional neural networks (CNN) have significantly advanced face detection. In particular, lightweight CNNbased architectures have achieved great success due to their lowcomplexity structure facilitating real-time detection tasks. However, current lightweight CNN-based face detectors trading accuracy for efficiency have inadequate capability in handling insufficient feature representation, faces with unbalanced aspect ratios and occlusion. Consequently, they exhibit deteriorated performance far lagging behind the deep heavy detectors. To achieve efficient face detection without sacrificing accuracy, we design an efficient deep face detector termed EfficientFace in this study, which contains three modules for feature enhancement. To begin with, we design a novel cross-scale feature fusion strategy to facilitate bottom-up information propagation, such that fusing low-level and highlevel features is further strengthened. Besides, this is conducive to estimating the locations of faces and enhancing the descriptive power of face features. Secondly, we introduce a Receptive Field Enhancement module to consider faces with various aspect ratios. Thirdly, we add an Attention Mechanism module for improving the representational capability of occluded faces. We have evaluated EfficientFace on four public benchmarks and experimental results demonstrate the appealing performance of our method. In particular, our model respectively achieves 95.1% (Easy), 94.0% (Medium) and 90.1% (Hard) on validation set of WIDER Face dataset, which is competitive with heavyweight models with only 1/15 computational costs of the state-of-the-art MogFace detector.
CLAug 16, 2024Code
A Survey on Benchmarks of Multimodal Large Language ModelsJian Li, Weiheng Lu, Hao Fei et al.
Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of 200 benchmarks and evaluations for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to support the development of MLLMs better. For more details, please visit our GitHub repository: https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey.
CVAug 21, 2023
ADNet: Lane Shape Prediction via Anchor DecompositionLingyu Xiao, Xiang Li, Sen Yang et al.
In this paper, we revisit the limitations of anchor-based lane detection methods, which have predominantly focused on fixed anchors that stem from the edges of the image, disregarding their versatility and quality. To overcome the inflexibility of anchors, we decompose them into learning the heat map of starting points and their associated directions. This decomposition removes the limitations on the starting point of anchors, making our algorithm adaptable to different lane types in various datasets. To enhance the quality of anchors, we introduce the Large Kernel Attention (LKA) for Feature Pyramid Network (FPN). This significantly increases the receptive field, which is crucial in capturing the sufficient context as lane lines typically run throughout the entire image. We have named our proposed system the Anchor Decomposition Network (ADNet). Additionally, we propose the General Lane IoU (GLIoU) loss, which significantly improves the performance of ADNet in complex scenarios. Experimental results on three widely used lane detection benchmarks, VIL-100, CULane, and TuSimple, demonstrate that our approach outperforms the state-of-the-art methods on VIL-100 and exhibits competitive accuracy on CULane and TuSimple. Code and models will be released on https://github.com/ Sephirex-X/ADNet.
CVSep 26, 2024Code
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionMing Dai, Lingfeng Yang, Yihao Xu et al.
Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \url{https://github.com/Dmmm1997/SimVG}.
CVApr 13, 2023
[CLS] Token is All You Need for Zero-Shot Semantic SegmentationLetian Wu, Wenyao Zhang, Tengping Jiang et al.
In this paper, we propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP. First, our study provides a couple of key discoveries: (i) the global tokens (a.k.a [CLS] tokens in Transformer) of the text branch in CLIP provide a powerful representation of semantic information and (ii) these text-side [CLS] tokens can be regarded as category priors to guide CLIP visual encoder pay more attention on the corresponding region of interest. Based on that, we build upon the CLIP model as a backbone which we extend with a One-Way [CLS] token navigation from text to the visual branch that enables zero-shot dense prediction, dubbed \textbf{ClsCLIP}. Specifically, we use the [CLS] token output from the text branch, as an auxiliary semantic prompt, to replace the [CLS] token in shallow layers of the ViT-based visual encoder. This one-way navigation embeds such global category prior earlier and thus promotes semantic segmentation. Furthermore, to better segment tiny objects in ZS3, we further enhance ClsCLIP with a local zoom-in strategy, which employs a region proposal pre-processing and we get ClsCLIP+. Extensive experiments demonstrate that our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
CVNov 9, 2025Code
SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object DetectionXin Zuo, Chenyu Qu, Haibo Zhan et al.
Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model's adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.
CVMar 4, 2023
Exploit CAM by itself: Complementary Learning System for Weakly Supervised Semantic SegmentationJiren Mai, Fei Zhang, Junjie Ye et al.
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has long been suffering from fragmentary object regions led by Class Activation Map (CAM), which is incapable of generating fine-grained masks for semantic segmentation. To guide CAM to find more non-discriminating object patterns, this paper turns to an interesting working mechanism in agent learning named Complementary Learning System (CLS). CLS holds that the neocortex builds a sensation of general knowledge, while the hippocampus specially learns specific details, completing the learned patterns. Motivated by this simple but effective learning pattern, we propose a General-Specific Learning Mechanism (GSLM) to explicitly drive a coarse-grained CAM to a fine-grained pseudo mask. Specifically, GSLM develops a General Learning Module (GLM) and a Specific Learning Module (SLM). The GLM is trained with image-level supervision to extract coarse and general localization representations from CAM. Based on the general knowledge in the GLM, the SLM progressively exploits the specific spatial knowledge from the localization representations, expanding the CAM in an explicit way. To this end, we propose the Seed Reactivation to help SLM reactivate non-discriminating regions by setting a boundary for activation values, which successively identifies more regions of CAM. Without extra refinement processes, our method is able to achieve breakthrough improvements for CAM of over 20.0% mIoU on PASCAL VOC 2012 and 10.0% mIoU on MS COCO 2014 datasets, representing a new state-of-the-art among existing WSSS methods.
LGOct 31, 2022
Probabilistic Decomposition Transformer for Time Series ForecastingJunlong Tong, Liping Xie, Wankou Yang et al.
Time series forecasting is crucial for many fields, such as disaster warning, weather prediction, and energy consumption. The Transformer-based models are considered to have revolutionized the field of sequence modeling. However, the complex temporal patterns of the time series hinder the model from mining reliable temporal dependencies. Furthermore, the autoregressive form of the Transformer introduces cumulative errors in the inference step. In this paper, we propose the probabilistic decomposition Transformer model that combines the Transformer with a conditional generative model, which provides hierarchical and interpretable probabilistic forecasts for intricate time series. The Transformer is employed to learn temporal patterns and implement primary probabilistic forecasts, while the conditional generative model is used to achieve non-autoregressive hierarchical probabilistic forecasts by introducing latent space feature representations. In addition, the conditional generative model reconstructs typical features of the series, such as seasonality and trend terms, from probability distributions in the latent space to enable complex pattern separation and provide interpretable forecasts. Extensive experiments on several datasets demonstrate the effectiveness and robustness of the proposed model, indicating that it compares favorably with the state of the art.
CVAug 13, 2022Code
Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-LocalizationMing Dai, Enhui Zheng, Jiahao Chen et al.
Image retrieval (IR) has emerged as a promising approach for self-localization in unmanned aerial vehicles (UAVs). However, IR-based methods face several challenges: 1) Pre- and post-processing incur significant computational and storage overhead; 2) The lack of interaction between dual-source features impairs precise spatial perception. In this paper, we propose an efficient heterogeneous spatial feature interaction method, termed Drone Referring Localization (DRL), which aims to localize UAV-view images within satellite imagery. Unlike conventional methods that treat different data sources in isolation, followed by cosine similarity computations, DRL facilitates the learnable interaction of heterogeneous features. To implement the proposed DRL, we design two transformer-based frameworks, Post-Fusion and Mix-Fusion, enabling end-to-end training and inference. Furthermore, we introduce random scale cropping and weight balance loss techniques to augment paired data and optimize the balance between positive and negative sample weights. Additionally, we construct a new dataset, UL14, and establish a benchmark tailored to the DRL framework. Compared to traditional IR methods, DRL achieves superior localization accuracy (MA@20 +9.4\%) while significantly reducing computational time (1/7) and storage overhead (1/3). The dataset and code will be made publicly available. The dataset and code are available at \url{https://github.com/Dmmm1997/DRL} .
ROSep 24, 2024Code
Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous DrivingLingyu Xiao, Jiang-Jiang Liu, Sen Yang et al.
The autoregressive world model exhibits robust generalization capabilities in vectorized scene understanding but encounters difficulties in deriving actions due to insufficient uncertainty modeling and self-delusion. In this paper, we explore the feasibility of deriving decisions from an autoregressive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses. We propose LatentDriver, a framework models the environment's next states and the ego vehicle's possible actions as a mixture distribution, from which a deterministic control signal is then derived. By incorporating mixture modeling, the stochastic nature of decisionmaking is captured. Additionally, the self-delusion problem is mitigated by providing intermediate actions sampled from a distribution to the world model. Experimental results on the recently released close-loop benchmark Waymax demonstrate that LatentDriver surpasses state-of-the-art reinforcement learning and imitation learning methods, achieving expert-level performance. The code and models will be made available at https://github.com/Sephirex-X/LatentDriver.
CVDec 11, 2023Code
SSPNet: Scale and Spatial Priors Guided Generalizable and Interpretable Pedestrian Attribute RecognitionJifeng Shen, Teng Guo, Xin Zuo et al.
Global feature based Pedestrian Attribute Recognition (PAR) models are often poorly localized when using Grad-CAM for attribute response analysis, which has a significant impact on the interpretability, generalizability and performance. Previous researches have attempted to improve generalization and interpretation through meticulous model design, yet they often have neglected or underutilized effective prior information crucial for PAR. To this end, a novel Scale and Spatial Priors Guided Network (SSPNet) is proposed for PAR, which is mainly composed of the Adaptive Feature Scale Selection (AFSS) and Prior Location Extraction (PLE) modules. The AFSS module learns to provide reasonable scale prior information for different attribute groups, allowing the model to focus on different levels of feature maps with varying semantic granularity. The PLE module reveals potential attribute spatial prior information, which avoids unnecessary attention on irrelevant areas and lowers the risk of model over-fitting. More specifically, the scale prior in AFSS is adaptively learned from different layers of feature pyramid with maximum accuracy, while the spatial priors in PLE can be revealed from part feature with different granularity (such as image blocks, human pose keypoint and sparse sampling points). Besides, a novel IoU based attribute localization metric is proposed for Weakly-supervised Pedestrian Attribute Localization (WPAL) based on the improved Grad-CAM for attribute response mask. The experimental results on the intra-dataset and cross-dataset evaluations demonstrate the effectiveness of our proposed method in terms of mean accuracy (mA). Furthermore, it also achieves superior performance on the PCS dataset for attribute localization in terms of IoU. Code will be released at https://github.com/guotengg/SSPNet.
CVJan 12, 2025Code
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsMing Dai, Jian Li, Jiedong Zhuang et al.
Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.
CVFeb 20, 2024Code
Object-level Geometric Structure Preserving for Natural Image StitchingWenxiao Cai, Wankou Yang
The topic of stitching images with globally natural structures holds paramount significance, with two main goals: pixel-level alignment and distortion prevention. The existing approaches exhibit the ability to align well, yet fall short in maintaining object structures. In this paper, we endeavour to safeguard the overall OBJect-level structures within images based on Global Similarity Prior (OBJ-GSP), on the basis of good alignment performance. Our approach leverages semantic segmentation models like the family of Segment Anything Model to extract the contours of any objects in a scene. Triangular meshes are employed in image transformation to protect the overall shapes of objects within images. The balance between alignment and distortion prevention is achieved by allowing the object meshes to strike a balance between similarity and projective transformation. We also demonstrate that object-level semantic information is necessary in low-altitude aerial image stitching. Additionally, we propose StitchBench, the largest image stitching benchmark with most diverse scenarios. Extensive experimental results demonstrate that OBJ-GSP outperforms existing methods in both pixel alignment and shape preservation. Code and dataset is publicly available at \url{https://github.com/RussRobin/OBJ-GSP}.
CVJul 2, 2025Code
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback SynergyMing Dai, Wenxuan Cheng, Jiang-jiang Liu et al.
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
CVMar 30, 2025Code
Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attentionXin Zuo, Jiaran Jiang, Jifeng Shen et al.
Underwater image understanding is crucial for both submarine navigation and seabed exploration. However, the low illumination in underwater environments degrades the imaging quality, which in turn seriously deteriorates the performance of underwater semantic segmentation, particularly for outlining the object region boundaries. To tackle this issue, we present UnderWater SegFormer (UWSegFormer), a transformer-based framework for semantic segmentation of low-quality underwater images. Firstly, we propose the Underwater Image Quality Attention (UIQA) module. This module enhances the representation of highquality semantic information in underwater image feature channels through a channel self-attention mechanism. In order to address the issue of loss of imaging details due to the underwater environment, the Multi-scale Aggregation Attention(MAA) module is proposed. This module aggregates sets of semantic features at different scales by extracting discriminative information from high-level features,thus compensating for the semantic loss of detail in underwater objects. Finally, during training, we introduce Edge Learning Loss (ELL) in order to enhance the model's learning of underwater object edges and improve the model's prediction accuracy. Experiments conducted on the SUIM and DUT-USEG (DUT) datasets have demonstrated that the proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods. In addition, the proposed method achieves the highest mIoU of 82.12 and 71.41 on the SUIM and DUT datasets, respectively. Code will be available at https://github.com/SAWRJJ/UWSegFormer.
CVSep 5, 2025Code
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity DiscriminationMing Dai, Wenxuan Cheng, Jiedong Zhuang et al.
Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.
CVSep 17, 2025Code
Improving Generalized Visual Grounding with Instance-aware Joint LearningMing Dai, Wenxuan Cheng, Jiang-Jiang Liu et al.
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.
CVJul 19, 2025Code
Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object DetectionJifeng Shen, Haibo Zhan, Shaohua Dong et al.
Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.
CVOct 10, 2025Code
MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel UnderstandingMing Dai, Sen Yang, Boqiang Duan et al.
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
CVSep 11, 2025Code
IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object DetectionJifeng Shen, Haibo Zhan, Xin Zuo et al.
Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance. To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference. Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power. Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains. In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.
CVJun 10, 2025Code
InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck MambaYuhang Wang, Jun Li, Zhijian Wu et al.
Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.
CVJun 19, 2024Code
SpatialBot: Precise Spatial Understanding with Vision Language ModelsWenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan et al.
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.
CVMay 23, 2023Code
VDD: Varied Drone Dataset for Semantic SegmentationWenxiao Cai, Ke Jin, Jinyan Hou et al.
Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential semantic details to understand scenes on the ground. Ensuring high accuracy of semantic segmentation models for drones requires access to diverse, large-scale, and high-resolution datasets, which are often scarce in the field of aerial image processing. While existing datasets typically focus on urban scenes and are relatively small, our Varied Drone Dataset (VDD) addresses these limitations by offering a large-scale, densely labeled collection of 400 high-resolution images spanning 7 classes. This dataset features various scenes in urban, industrial, rural, and natural areas, captured from different camera angles and under diverse lighting conditions. We also make new annotations to UDD and UAVid, integrating them under VDD annotation standards, to create the Integrated Drone Dataset (IDD). We train seven state-of-the-art models on drone datasets as baselines. It's expected that our dataset will generate considerable interest in drone image segmentation and serve as a foundation for other drone vision tasks. Datasets are publicly available at \href{our website}{https://github.com/RussRobin/VDD}.
CVJan 23, 2022Code
Vision-Based UAV Self-Positioning in Low-Altitude Urban EnvironmentsMing Dai, Enhui Zheng, Zhenhua Feng et al.
Unmanned Aerial Vehicles (UAVs) rely on satellite systems for stable positioning. However, due to limited satellite coverage or communication disruptions, UAVs may lose signals from satellite-based positioning systems. In such situations, vision-based techniques can serve as an alternative, ensuring the self-positioning capability of UAVs. However, most of the existing datasets are developed for the geo-localization tasks of the objects identified by UAVs, rather than the self-positioning task of UAVs. Furthermore, the current UAV datasets use discrete sampling on synthetic data, such as Google Maps, thereby neglecting the crucial aspects of dense sampling and the uncertainties commonly experienced in real-world scenarios. To address these issues, this paper presents a new dataset, DenseUAV, which is the first publicly available dataset designed for the UAV self-positioning task. DenseUAV adopts dense sampling on UAV images obtained in low-altitude urban settings. In total, over 27K UAV-view and satellite-view images of 14 university campuses are collected and annotated, establishing a new benchmark. In terms of model development, we first verify the superiority of Transformers over CNNs in this task. Then, we incorporate metric learning into representation learning to enhance the discriminative capacity of the model and to lessen the modality discrepancy. Besides, to facilitate joint learning from both perspectives, we propose a mutually supervised learning approach. Last, we enhance the Recall@K metric and introduce a new measurement, SDM@K, to evaluate the performance of a trained model from both the retrieval and localization perspectives simultaneously. As a result, the proposed baseline method achieves a remarkable Recall@1 score of 83.05% and an SDM@1 score of 86.24% on DenseUAV. The dataset and code will be made publicly available on https://github.com/Dmmm1997/DenseUAV.
CVDec 28, 2020Code
TransPose: Keypoint Localization via TransformerSen Yang, Zhibin Quan, Mu Nie et al.
While CNN-based models have made remarkable progress on human pose estimation, what spatial dependencies they capture to localize keypoints remains unclear. In this work, we propose a model called \textbf{TransPose}, which introduces Transformer for human pose estimation. The attention layers built in Transformer enable our model to capture long-range relationships efficiently and also can reveal what dependencies the predicted keypoints rely on. To predict keypoint heatmaps, the last attention layer acts as an aggregator, which collects contributions from image clues and forms maximum positions of keypoints. Such a heatmap-based localization approach via Transformer conforms to the principle of Activation Maximization~\cite{erhan2009visualizing}. And the revealed dependencies are image-specific and fine-grained, which also can provide evidence of how the model handles special cases, e.g., occlusion. The experiments show that TransPose achieves 75.8 AP and 75.0 AP on COCO validation and test-dev sets, while being more lightweight and faster than mainstream CNN architectures. The TransPose model also transfers very well on MPII benchmark, achieving superior performance on the test set when fine-tuned with small training costs. Code and pre-trained models are publicly available\footnote{\url{https://github.com/yangsenius/TransPose}}.
CVSep 21, 2020Code
Learning Spatio-Appearance Memory Network for High-Performance Visual TrackingFei Xie, Wankou Yang, Bo Liu et al.
Existing visual object tracking usually learns a bounding-box based template to match the targets across frames, which cannot accurately learn a pixel-wise representation, thereby being limited in handling severe appearance variations. To address these issues, much effort has been made on segmentation-based tracking, which learns a pixel-wise object-aware template and can achieve higher accuracy than bounding-box template based tracking. However, existing segmentation-based trackers are ineffective in learning the spatio-temporal correspondence across frames due to no use of the rich temporal information. To overcome this issue, this paper presents a novel segmentation-based tracking architecture, which is equipped with a spatio-appearance memory network to learn accurate spatio-temporal correspondence. Among it, an appearance memory network explores spatio-temporal non-local similarity to learn the dense correspondence between the segmentation mask and the current frame. Meanwhile, a spatial memory network is modeled as discriminative correlation filter to learn the mapping between feature map and spatial map. The appearance memory network helps to filter out the noisy samples in the spatial memory network while the latter provides the former with more accurate target geometrical center. This mutual promotion greatly boosts the tracking performance. Without bells and whistles, our simple-yet-effective tracking architecture sets new state-of-the-arts on the VOT2016, VOT2018, VOT2019, GOT-10K, TrackingNet, and VOT2020 benchmarks, respectively. Besides, our tracker outperforms the leading segmentation-based trackers SiamMask and D3S on two video object segmentation benchmarks DAVIS16 and DAVIS17 by a large margin. The source codes can be found at https://github.com/phiphiphi31/DMB.
CVSep 16, 2019Code
Pose Neural Fabrics SearchSen Yang, Wankou Yang, Zhen Cui
Neural Architecture Search (NAS) technologies have emerged in many domains to jointly learn the architectures and weights of the neural network. However, most existing NAS works claim they are task-specific and focus only on optimizing a single architecture to replace a human-designed neural network, in fact, their search processes are almost independent of domain knowledge of the tasks. In this paper, we propose Pose Neural Fabrics Search (PoseNFS). We explore a new solution for NAS and human pose estimation task: part-specific neural architecture search, which can be seen as a variant of multi-task learning. Firstly, we design a new neural architecture search space, Cell-based Neural Fabric (CNF), to learn micro as well as macro neural architecture using a differentiable search strategy. Then, we view locating human keypoints as multiple disentangled prediction sub-tasks, and then use prior knowledge of body structure as guidance to search for multiple part-specific neural architectures for different human parts. After search, all these part-specific CNFs have distinct micro and macro architecture parameters. The results show that such knowledge-guided NAS-based architectures have obvious performance improvements to a hand-designed part-based baseline model. The experiments on MPII and MS-COCO datasets demonstrate that PoseNFS\footnote{Code is available at \url{https://github.com/yangsenius/PoseNFS}} can achieve comparable performance to some efficient and state-of-the-art methods.
CVMar 25
UW-VOS: A Large-Scale Dataset for Underwater Video Object SegmentationHongshen Zhao, Jingkang Tai, Yuhang Wu et al.
Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce $\textbf{UW-VOS}$, the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose $\textbf{SAM-U}$, a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only $\sim$2$\%$ trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point $\mathcal{J}\&\mathcal{F}$ drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.
CVOct 11, 2023
CLIP for Lightweight Semantic SegmentationKe Jin, Wankou Yang
The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.
CVNov 26, 2025
EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision TokensZe Feng, Sen Yang, Boqiang Duan et al.
Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.
CVJan 23, 2024
Correlation-Embedded Transformer Tracking: A Single-Branch FrameworkFei Xie, Wankou Yang, Chunyu Wang et al.
Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used for predicting target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.
CVFeb 17, 2025
Precise GPS-Denied UAV Self-Positioning via Context-Enhanced Cross-View Geo-LocalizationYuanze Xu, Ming Dai, Wenxiao Cai et al.
Image retrieval has been employed as a robust complementary technique to address the challenge of Unmanned Aerial Vehicles (UAVs) self-positioning. However, most existing methods primarily focus on localizing objects captured by UAVs through complex part-based representations, often overlooking the unique challenges associated with UAV self-positioning, such as fine-grained spatial discrimination requirements and dynamic scene variations. To address the above issues, we propose the Context-Enhanced method for precise UAV Self-Positioning (CEUSP), specifically designed for UAV self-positioning tasks. CEUSP integrates a Dynamic Sampling Strategy (DSS) to efficiently select optimal negative samples, while the Rubik's Cube Attention (RCA) module, combined with the Context-Aware Channel Integration (CACI) module, enhances feature representation and discrimination by exploiting interdimensional interactions, inspired by the rotational mechanics of a Rubik's Cube. Extensive experimental validate the effectiveness of the proposed method, demonstrating notable improvements in feature representation and UAV self-positioning accuracy within complex urban environments. Our approach achieves state-of-the-art performance on the DenseUAV dataset, which is specifically designed for dense urban contexts, and also delivers competitive results on the widely recognized University-1652 benchmark.
CVDec 24, 2024
Probabilistic Modeling of Disparity Uncertainty for Robust and Efficient Stereo MatchingWenxiao Cai, Dongting Hu, Ruoyan Yin et al.
Stereo matching plays a crucial role in various applications, where understanding uncertainty can enhance both safety and reliability. Despite this, the estimation and analysis of uncertainty in stereo matching have been largely overlooked. Previous works struggle to separate it into data (aleatoric) and model (epistemic) components and often provide limited interpretations of uncertainty. This interpretability is essential, as it allows for a clearer understanding of the underlying sources of error, enhancing both prediction confidence and decision-making processes. In this paper, we propose a new uncertainty-aware stereo matching framework. We adopt Bayes risk as the measurement of uncertainty and use it to separately estimate data and model uncertainty. We systematically analyze data uncertainty based on the probabilistic distribution of disparity and efficiently estimate model uncertainty without repeated model training. Experiments are conducted on four stereo benchmarks, and the results demonstrate that our method can estimate uncertainty accurately and efficiently, without sacrificing the disparity prediction accuracy.
CVJun 4, 2025
Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature ResampleZe Feng, Jiang-Jiang Liu, Sen Yang et al.
In this work, we study the Efficient Multimodal Large Language Model. Redundant vision tokens consume a significant amount of computational memory and resources. Therefore, many previous works compress them in the Vision Projector to reduce the number of vision tokens. However, simply compressing in the Vision Projector can lead to the loss of visual information, especially for tasks that rely on fine-grained spatial relationships, such as OCR and Chart \& Table Understanding. To address this problem, we propose Vision Remember, which is inserted between the LLM decoder layers to allow vision tokens to re-memorize vision features. Specifically, we retain multi-level vision features and resample them with the vision tokens that have interacted with the text token. During the resampling process, each vision token only attends to a local region in vision features, which is referred to as saliency-enhancing local attention. Saliency-enhancing local attention not only improves computational efficiency but also captures more fine-grained contextual information and spatial relationships within the region. Comprehensive experiments on multiple visual understanding benchmarks validate the effectiveness of our method when combined with various Efficient Vision Projectors, showing performance gains without sacrificing efficiency. Based on Vision Remember, LLaVA-VR with only 2B parameters is also superior to previous representative MLLMs such as Tokenpacker-HD-7B and DeepSeek-VL-7B.
CVDec 5, 2021
Learning Tracking Representations via Dual-Branch Fully Transformer NetworksFei Xie, Chunyu Wang, Guangting Wang et al.
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. For each token, we estimate whether it contains the target object and the corresponding size. The advantage of the approach is that the features are learned from matching, and ultimately, for matching. So the features are aligned with the object tracking task. The method achieves better or comparable results as the best-performing methods which first use CNN to extract features and then use Transformer to fuse them. It outperforms the state-of-the-art methods on the GOT-10k and VOT2020 benchmarks. In addition, the method achieves real-time inference speed (about $40$ fps) on one GPU. The code and models will be released.
CVNov 25, 2021
Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware AssociationSen Yang, Zhicheng Wang, Ze Chen et al.
This paper presents a new method to solve keypoint detection and instance association by using Transformer. For bottom-up multi-person pose estimation models, they need to detect keypoints and learn associative information between keypoints. We argue that these problems can be entirely solved by Transformer. Specifically, the self-attention in Transformer measures dependencies between any pair of locations, which can provide association information for keypoints grouping. However, the naive attention patterns are still not subjectively controlled, so there is no guarantee that the keypoints will always attend to the instances to which they belong. To address it we propose a novel approach of supervising self-attention for multi-person keypoint detection and instance association. By using instance masks to supervise self-attention to be instance-aware, we can assign the detected keypoints to their corresponding instances based on the pairwise attention scores, without using pre-defined offset vector fields or embedding like CNN-based bottom-up models. An additional benefit of our method is that the instance segmentation results of any number of people can be directly obtained from the supervised attention matrix, thereby simplifying the pixel assignment pipeline. The experiments on the COCO multi-person keypoint detection challenge and person instance segmentation task demonstrate the effectiveness and simplicity of the proposed method and show a promising way to control self-attention behavior for specific purposes.
CVJul 29, 2021
Video Based Fall Detection Using Human PosesZiwei Chen, Yiye Wang, Wankou Yang
Video based fall detection accuracy has been largely improved due to the recent progress on deep convolutional neural networks. However, there still exists some challenges, such as lighting variation, complex background, which degrade the accuracy and generalization ability of these approaches. Meanwhile, large computation cost limits the application of existing fall detection approaches. To alleviate these problems, a video based fall detection approach using human poses is proposed in this paper. First, a lightweight pose estimator extracts 2D poses from video sequences and then 2D poses are lifted to 3D poses. Second, we introduce a robust fall detection network to recognize fall events using estimated 3D poses, which increases respective filed and maintains low computation cost by dilated convolutions. The experimental results show that the proposed fall detection approach achieves a high accuracy of 99.83% on large benchmark action recognition dataset NTU RGB+D and real-time performance of 18 FPS on a non-GPU platform and 63 FPS on a GPU platform.
CVJul 7, 2021
SimCC: a Simple Coordinate Classification Perspective for Human Pose EstimationYanjie Li, Sen Yang, Peidong Liu et al.
The 2D heatmap-based approaches have dominated Human Pose Estimation (HPE) for years due to high performance. However, the long-standing quantization error problem in the 2D heatmap-based methods leads to several well-known drawbacks: 1) The performance for the low-resolution inputs is limited; 2) To improve the feature map resolution for higher localization precision, multiple costly upsampling layers are required; 3) Extra post-processing is adopted to reduce the quantization error. To address these issues, we aim to explore a brand new scheme, called \textit{SimCC}, which reformulates HPE as two classification tasks for horizontal and vertical coordinates. The proposed SimCC uniformly divides each pixel into several bins, thus achieving \emph{sub-pixel} localization precision and low quantization error. Benefiting from that, SimCC can omit additional refinement post-processing and exclude upsampling layers under certain settings, resulting in a more simple and effective pipeline for HPE. Extensive experiments conducted over COCO, CrowdPose, and MPII datasets show that SimCC outperforms heatmap-based counterparts, especially in low-resolution settings by a large margin.
CVApr 8, 2021
TokenPose: Learning Keypoint Tokens for Human Pose EstimationYanjie Li, Shoukui Zhang, Zhicheng Wang et al.
Human pose estimation deeply relies on visual clues and anatomical constraints between parts to locate keypoints. Most existing CNN-based methods do well in visual representation, however, lacking in the ability to explicitly learn the constraint relationships between keypoints. In this paper, we propose a novel approach based on Token representation for human Pose estimation~(TokenPose). In detail, each keypoint is explicitly embedded as a token to simultaneously learn constraint relationships and appearance cues from images. Extensive experiments show that the small and large TokenPose models are on par with state-of-the-art CNN-based counterparts while being more lightweight. Specifically, our TokenPose-S and TokenPose-L achieve $72.5$ AP and $75.8$ AP on COCO validation dataset respectively, with significant reduction in parameters ($\downarrow80.6\%$; $\downarrow$ $56.8\%$) and GFLOPs ($\downarrow$ $75.3\%$; $\downarrow$ $24.7\%$). Code is publicly available.
CVMar 29, 2021
SIENet: Spatial Information Enhancement Network for 3D Object Detection from Point CloudZiyu Li, Yuncong Yao, Zhibin Quan et al.
LiDAR-based 3D object detection pushes forward an immense influence on autonomous vehicles. Due to the limitation of the intrinsic properties of LiDAR, fewer points are collected at the objects farther away from the sensor. This imbalanced density of point clouds degrades the detection accuracy but is generally neglected by previous works. To address the challenge, we propose a novel two-stage 3D object detection framework, named SIENet. Specifically, we design the Spatial Information Enhancement (SIE) module to predict the spatial shapes of the foreground points within proposals, and extract the structure information to learn the representative features for further box refinement. The predicted spatial shapes are complete and dense point sets, thus the extracted structure information contains more semantic representation. Besides, we design the Hybrid-Paradigm Region Proposal Network (HP-RPN) which includes multiple branches to learn discriminate features and generate accurate proposals for the SIE module. Extensive experiments on the KITTI 3D object detection benchmark show that our elaborately designed SIENet outperforms the state-of-the-art methods by a large margin.
CVJan 17, 2021
Separable Batch Normalization for Robust Facial Landmark Localization with Cross-protocol Network TrainingShuangping Jin, Zhenhua Feng, Wankou Yang et al.
A big, diverse and balanced training data is the key to the success of deep neural network training. However, existing publicly available datasets used in facial landmark localization are usually much smaller than those for other computer vision tasks. A small dataset without diverse and balanced training samples cannot support the training of a deep network effectively. To address the above issues, this paper presents a novel Separable Batch Normalization (SepBN) module with a Cross-protocol Network Training (CNT) strategy for robust facial landmark localization. Different from the standard BN layer that uses all the training data to calculate a single set of parameters, SepBN considers that the samples of a training dataset may belong to different sub-domains. Accordingly, the proposed SepBN module uses multiple sets of parameters, each corresponding to a specific sub-domain. However, the selection of an appropriate branch in the inference stage remains a challenging task because the sub-domain of a test sample is unknown. To mitigate this difficulty, we propose a novel attention mechanism that assigns different weights to each branch for automatic selection in an effective style. As a further innovation, the proposed CNT strategy trains a network using multiple datasets having different facial landmark annotation systems, boosting the performance and enhancing the generalization capacity of the trained network. The experimental results obtained on several well-known datasets demonstrate the effectiveness of the proposed method.
CVSep 16, 2019
Unsupervised Eyeglasses Removal in the WildBingwen Hu, Zhedong Zheng, Ping Liu et al.
Eyeglasses removal is challenging in removing different kinds of eyeglasses, e.g., rimless glasses, full-rim glasses and sunglasses, and recovering appropriate eyes. Due to the large visual variants, the conventional methods lack scalability. Most existing works focus on the frontal face images in the controlled environment, such as the laboratory, and need to design specific systems for different eyeglass types. To address the limitation, we propose a unified eyeglass removal model called Eyeglasses Removal Generative Adversarial Network (ERGAN), which could handle different types of glasses in the wild. The proposed method does not depend on the dense annotation of eyeglasses location but benefits from the large-scale face images with weak annotations. Specifically, we study the two relevant tasks simultaneously, i.e., removing and wearing eyeglasses. Given two facial images with and without eyeglasses, the proposed model learns to swap the eye area in two faces. The generation mechanism focuses on the eye area and invades the difficulty of generating a new face. In the experiment, we show the proposed method achieves a competitive removal quality in terms of realism and diversity. Furthermore, we evaluate ERGAN on several subsequent tasks, such as face verification and facial expression recognition. The experiment shows that our method could serve as a pre-processing method for these tasks.
CVJul 26, 2018
Discriminative multi-view Privileged Information learning for image re-rankingJun Li, Chang Xu, Wankou Yang et al.
Conventional multi-view re-ranking methods usually perform asymmetrical matching between the region of interest (ROI) in the query image and the whole target image for similarity computation. Due to the inconsistency in the visual appearance, this practice tends to degrade the retrieval accuracy particularly when the image ROI, which is usually interpreted as the image objectness, accounts for a smaller region in the image. Since Privileged Information (PI), which can be viewed as the image prior, enables well characterizing the image objectness, we are aiming at leveraging PI for further improving the performance of the multi-view re-ranking accuracy in this paper. Towards this end, we propose a discriminative multi-view re-ranking approach in which both the original global image visual contents and the local auxiliary PI features are simultaneously integrated into a unified training framework for generating the latent subspaces with sufficient discriminating power. For the on-the-fly re-ranking, since the multi-view PI features are unavailable, we only project the original multi-view image representations onto the latent subspace, and thus the re-ranking can be achieved by computing and sorting the distances from the multi-view embeddings to the separating hyperplane. Extensive experimental evaluations on the two public benchmarks Oxford5k and Paris6k reveal our approach provides further performance boost for accurate image re-ranking, whilst the comparative study demonstrates the advantage of our method against other multi-view re-ranking methods.
CVMar 16, 2018
Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis ToolFeng Liu, Tao Xiang, Timothy M. Hospedales et al.
In recent years, visual question answering (VQA) has become topical. The premise of VQA's significance as a benchmark in AI, is that both the image and textual question need to be well understood and mutually grounded in order to infer the correct answer. However, current VQA models perhaps `understand' less than initially hoped, and instead master the easier task of exploiting cues given away in the question and biases in the answer distribution. In this paper we propose the inverse problem of VQA (iVQA). The iVQA task is to generate a question that corresponds to a given image and answer pair. We propose a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer. Based on this model, we show that iVQA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iVQA model needs to understand the image better to be successful. As a second contribution, we show how to use iVQA in a novel reinforcement learning framework to diagnose any existing VQA model by way of exposing its belief set: the set of question-answer pairs that the VQA model would predict true for a given image. This provides a completely new window into what VQA models `believe' about images. We show that existing VQA models have more erroneous beliefs than previously thought, revealing their intrinsic weaknesses. Suggestions are then made on how to address these weaknesses going forward.
CVOct 10, 2017
iVQA: Inverse Visual Question AnsweringFeng Liu, Tao Xiang, Timothy M. Hospedales et al.
We propose the inverse problem of Visual question answering (iVQA), and explore its suitability as a benchmark for visuo-linguistic understanding. The iVQA task is to generate a question that corresponds to a given image and answer pair. Since the answers are less informative than the questions, and the questions have less learnable bias, an iVQA model needs to better understand the image to be successful than a VQA model. We pose question generation as a multi-modal dynamic inference process and propose an iVQA model that can gradually adjust its focus of attention guided by both a partially generated question and the answer. For evaluation, apart from existing linguistic metrics, we propose a new ranking metric. This metric compares the ground truth question's rank among a list of distractors, which allows the drawbacks of different algorithms and sources of error to be studied. Experimental results show that our model can generate diverse, grammatically correct and content correlated questions that match the given answer.
MMMar 16, 2017
Refining Image Categorization by Exploiting Web Images and General CorpusYazhou Yao, Jian Zhang, Fumin Shen et al.
Studies show that refining real-world categories into semantic subcategories contributes to better image modeling and classification. Previous image sub-categorization work relying on labeled images and WordNet's hierarchy is not only labor-intensive, but also restricted to classify images into NOUN subcategories. To tackle these problems, in this work, we exploit general corpus information to automatically select and subsequently classify web images into semantic rich (sub-)categories. The following two major challenges are well studied: 1) noise in the labels of subcategories derived from the general corpus; 2) noise in the labels of images retrieved from the web. Specifically, we first obtain the semantic refinement subcategories from the text perspective and remove the noise by the relevance-based approach. To suppress the search error induced noisy images, we then formulate image selection and classifier learning as a multi-class multi-instance learning problem and propose to solve the employed problem by the cutting-plane algorithm. The experiments show significant performance gains by using the generated data of our way on both image categorization and sub-categorization tasks. The proposed approach also consistently outperforms existing weakly supervised and web-supervised approaches.