CVJun 26, 2024Code
Boosting Few-Shot Open-Set Object Detection via Prompt Learning and Robust Decision BoundaryZhaowei Wu, Binyi Su, Qichuan Geng et al.
Few-shot Open-set Object Detection (FOOD) poses a challenge in many open-world scenarios. It aims to train an open-set detector to detect known objects while rejecting unknowns with scarce training samples. Existing FOOD methods are subject to limited visual information, and often exhibit an ambiguous decision boundary between known and unknown classes. To address these limitations, we propose the first prompt-based few-shot open-set object detection framework, which exploits additional textual information and delves into constructing a robust decision boundary for unknown rejection. Specifically, as no available training data for unknown classes, we select pseudo-unknown samples with Attribution-Gradient based Pseudo-unknown Mining (AGPM), which leverages the discrepancy in attribution gradients to quantify uncertainty. Subsequently, we propose Conditional Evidence Decoupling (CED) to decouple and extract distinct knowledge from selected pseudo-unknown samples by eliminating opposing evidence. This optimization process can enhance the discrimination between known and unknown classes. To further regularize the model and form a robust decision boundary for unknown rejection, we introduce Abnormal Distribution Calibration (ADC) to calibrate the output probability distribution of local abnormal features in pseudo-unknown samples. Our method achieves superior performance over previous state-of-the-art approaches, improving the average recall of unknown class by 7.24% across all shots in VOC10-5-5 dataset settings and 1.38% in VOC-COCO dataset settings. Our source code is available at https://gitee.com/VR_NAVE/ced-food.
CVDec 28, 2025
Spatial-aware Symmetric Alignment for Text-guided Medical Image SegmentationLinglin Liao, Qichuan Geng, Yu Liu
Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text "in the left lower lung", the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.
CVNov 27, 2020
Frequency Domain Image Translation: More Photo-realistic, Better Identity-preservingMu Cai, Hong Zhang, Huijuan Huang et al.
Image-to-image translation has been revolutionized with GAN-based methods. However, existing methods lack the ability to preserve the identity of the source domain. As a result, synthesized images can often over-adapt to the reference domain, losing important structural characteristics and suffering from suboptimal visual quality. To solve these challenges, we propose a novel frequency domain image translation (FDIT) framework, exploiting frequency information for enhancing the image generation process. Our key idea is to decompose the image into low-frequency and high-frequency components, where the high-frequency feature captures object structure akin to the identity. Our training objective facilitates the preservation of frequency information in both pixel space and Fourier spectral space. We broadly evaluate FDIT across five large-scale datasets and multiple tasks including image translation and GAN inversion. Extensive experiments and ablations show that FDIT effectively preserves the identity of the source image, and produces photo-realistic images. FDIT establishes state-of-the-art performance, reducing the average FID score by 5.6% compared to the previous best method.
CVOct 23, 2020
Object-aware Feature Aggregation for Video Object DetectionQichuan Geng, Hong Zhang, Na Jiang et al.
We present an Object-aware Feature Aggregation (OFA) module for video object detection (VID). Our approach is motivated by the intriguing property that video-level object-aware knowledge can be employed as a powerful semantic prior to help object recognition. As a consequence, augmenting features with such prior knowledge can effectively improve the classification and localization performance. To make features get access to more content about the whole video, we first capture the object-aware knowledge of proposals and incorporate such knowledge with the well-established pair-wise contexts. With extensive experimental results on the ImageNet VID dataset, our approach demonstrates the effectiveness of object-aware knowledge with the superior performance of 83.93% and 86.09% mAP with ResNet-101 and ResNeXt-101, respectively. When further equipped with Sequence DIoU NMS, we obtain the best-reported mAP of 85.07% and 86.88% upon the paper submitted. The code to reproduce our results will be released after acceptance.
CVJan 19, 2020
Gated Path Selection Network for Semantic SegmentationQichuan Geng, Hong Zhang, Xiaojuan Qi et al.
Semantic segmentation is a challenging task that needs to handle large scale variations, deformations and different viewpoints. In this paper, we develop a novel network named Gated Path Selection Network (GPSNet), which aims to learn adaptive receptive fields. In GPSNet, we first design a two-dimensional multi-scale network - SuperNet, which densely incorporates features from growing receptive fields. To dynamically select desirable semantic context, a gate prediction module is further introduced. In contrast to previous works that focus on optimizing sample positions on the regular grids, GPSNet can adaptively capture free form dense semantic contexts. The derived adaptive receptive fields are data-dependent, and are flexible that can model different object geometric transformations. On two representative semantic segmentation datasets, i.e., Cityscapes, and ADE20K, we show that the proposed approach consistently outperforms previous methods and achieves competitive performance without bells and whistles.
CVJan 23, 2019
AADS: Augmented Autonomous Driving Simulation using Data-driven AlgorithmsWei Li, Chengwei Pan, Rong Zhang et al.
Simulation systems have become an essential component in the development and validation of autonomous driving technologies. The prevailing state-of-the-art approach for simulation is to use game engines or high-fidelity computer graphics (CG) models to create driving scenarios. However, creating CG models and vehicle movements (e.g., the assets for simulation) remains a manual task that can be costly and time-consuming. In addition, the fidelity of CG images still lacks the richness and authenticity of real-world images and using these images for training leads to degraded performance. In this paper we present a novel approach to address these issues: Augmented Autonomous Driving Simulation (AADS). Our formulation augments real-world pictures with a simulated traffic flow to create photo-realistic simulation images and renderings. More specifically, we use LiDAR and cameras to scan street scenes. From the acquired trajectory data, we generate highly plausible traffic flows for cars and pedestrians and compose them into the background. The composite images can be re-synthesized with different viewpoints and sensor models. The resulting images are photo-realistic, fully annotated, and ready for end-to-end training and testing of autonomous driving systems from perception to planning. We explain our system design and validate our algorithms with a number of autonomous driving tasks from detection to segmentation and predictions. Compared to traditional approaches, our method offers unmatched scalability and realism. Scalability is particularly important for AD simulation and we believe the complexity and diversity of the real world cannot be realistically captured in a virtual environment. Our augmented approach combines the flexibility in a virtual environment (e.g., vehicle movements) with the richness of the real world to allow effective simulation of anywhere on earth.
CVNov 27, 2018
Part-level Car Parsing and Reconstruction from Single Street ViewQichuan Geng, Hong Zhang, Xinyu Huang et al.
Part information has been shown to be resistant to occlusions and viewpoint changes, which is beneficial for various vision-related tasks. However, we found very limited work in car pose estimation and reconstruction from street views leveraging the part information. There are two major contributions in this paper. Firstly, we make the first attempt to build a framework to simultaneously estimate shape, translation, orientation, and semantic parts of cars in 3D space from a single street view. As it is labor-intensive to annotate semantic parts on real street views, we propose a specific approach to implicitly transfer part features from synthesized images to real street views. For pose and shape estimation, we propose a novel network structure that utilizes both part features and 3D losses. Secondly, we are the first to construct a high-quality dataset that contains 348 different car models with physical dimensions and part-level annotations based on global and local deformations. Given these models, we further generate 60K synthesized images with randomization of orientation, illumination, occlusion, and texture. Our results demonstrate that our part segmentation performance is significantly improved after applying our implicit transfer approach. Our network for pose and shape estimation achieves the state-of-the-art performance on the ApolloCar3D dataset and outperforms 3D-RCNN and DeepMANTA by 12.57 and 8.91 percentage points in terms of mean A3DP-Abs.
CVAug 1, 2018
A Network Structure to Explicitly Reduce Confusion Errors in Semantic SegmentationQichuan Geng, Xinyu Huang, Zhong Zhou et al.
Confusing classes that are ubiquitous in real world often degrade performance for many vision related applications like object detection, classification, and segmentation. The confusion errors are not only caused by similar visual patterns but also amplified by various factors during the training of our designed models, such as reduced feature resolution in the encoding process or imbalanced data distributions. A large amount of deep learning based network structures has been proposed in recent years to deal with these individual factors and improve network performance. However, to our knowledge, no existing work in semantic image segmentation is designed to tackle confusion errors explicitly. In this paper, we present a novel and general network structure that reduces confusion errors in more direct manner and apply the network for semantic segmentation. There are two major contributions in our network structure: 1) We ensemble subnets with heterogeneous output spaces based on the discriminative confusing groups. The training for each subnet can distinguish confusing classes within the group without affecting unrelated classes outside the group. 2) We propose an improved cross-entropy loss function that maximizes the probability assigned to the correct class and penalizes the probabilities assigned to the confusing classes at the same time. Our network structure is a general structure and can be easily adapted to any other networks to further reduce confusion errors. Without any changes in the feature encoder and post-processing steps, our experiments demonstrate consistent and significant improvements on different baseline models on Cityscapes and PASCAL VOC datasets (e.g., 3.05% over ResNet-101 and 1.30% over ResNet-38).
CVMar 16, 2018
The ApolloScape Open Dataset for Autonomous Driving and its ApplicationXinyu Huang, Peng Wang, Xinjing Cheng et al.
Autonomous driving has attracted tremendous attention especially in the past few years. The key techniques for a self-driving car include solving tasks like 3D map construction, self-localization, parsing the driving road and understanding objects, which enable vehicles to reason and act. However, large scale data set for training and system evaluation is still a bottleneck for developing robust perception models. In this paper, we present the ApolloScape dataset [1] and its applications for autonomous driving. Compared with existing public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape contains much large and richer labelling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labelling, lanemark labelling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities and daytimes. For each task, it contains at lease 15x larger amount of images than SOTA datasets. To label such a complete dataset, we develop various tools and algorithms specified for each task to accelerate the labelling process, such as 3D-2D segment labeling tools, active labelling in videos etc. Depend on ApolloScape, we are able to develop algorithms jointly consider the learning and inference of multiple tasks. In this paper, we provide a sensor fusion scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and a 3D semantic map in order to achieve robust self-localization and semantic segmentation for autonomous driving. We show that practically, sensor fusion and joint learning of multiple tasks are beneficial to achieve a more robust and accurate system. We expect our dataset and proposed relevant algorithms can support and motivate researchers for further development of multi-sensor fusion and multi-task learning in the field of computer vision.