Bo Zhang

CV
h-index23
12papers
1,242citations
Novelty53%
AI Score45

12 Papers

6.8CVSep 15, 2023Code
Rethinking Cross-Domain Pedestrian Detection: A Background-Focused Distribution Alignment Framework for Instance-Free One-Stage Detectors

Yancheng Cai, Bo Zhang, Baopu Li et al. · cambridge, deepmind

Cross-domain pedestrian detection aims to generalize pedestrian detectors from one label-rich domain to another label-scarce domain, which is crucial for various real-world applications. Most recent works focus on domain alignment to train domain-adaptive detectors either at the instance level or image level. From a practical point of view, one-stage detectors are faster. Therefore, we concentrate on designing a cross-domain algorithm for rapid one-stage detectors that lacks instance-level proposals and can only perform image-level feature alignment. However, pure image-level feature alignment causes the foreground-background misalignment issue to arise, i.e., the foreground features in the source domain image are falsely aligned with background features in the target domain image. To address this issue, we systematically analyze the importance of foreground and background in image-level cross-domain alignment, and learn that background plays a more critical role in image-level cross-domain alignment. Therefore, we focus on cross-domain background feature alignment while minimizing the influence of foreground features on the cross-domain alignment stage. This paper proposes a novel framework, namely, background-focused distribution alignment (BFDA), to train domain adaptive onestage pedestrian detectors. Specifically, BFDA first decouples the background features from the whole image feature maps and then aligns them via a novel long-short-range discriminator.

28.7CVMar 7, 2022Code
Adversarial Texture for Fooling Person Detectors in the Physical World

Zhanhao Hu, Siyuan Huang, Xiaopei Zhu et al.

Nowadays, cameras equipped with AI systems can capture and analyze images to detect people automatically. However, the AI system can make mistakes when receiving deliberately designed patterns in the real world, i.e., physical adversarial examples. Prior works have shown that it is possible to print adversarial patches on clothes to evade DNN-based person detectors. However, these adversarial examples could have catastrophic drops in the attack success rate when the viewing angle (i.e., the camera's angle towards the object) changes. To perform a multi-angle attack, we propose Adversarial Texture (AdvTexture). AdvTexture can cover clothes with arbitrary shapes so that people wearing such clothes can hide from person detectors from different viewing angles. We propose a generative method, named Toroidal-Cropping-based Expandable Generative Attack (TC-EGA), to craft AdvTexture with repetitive structures. We printed several pieces of cloth with AdvTexure and then made T-shirts, skirts, and dresses in the physical world. Experiments showed that these clothes could fool person detectors in the physical world.

15.3CVJan 30, 2023Code
Language-Driven Anchors for Zero-Shot Adversarial Robustness

Xiao Li, Wei Zhang, Yining Liu et al.

Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting, leaving the challenging domain of zero-shot adversarial robustness an open question. In this work, we investigate this domain by leveraging the recent advances in large vision-language models, such as CLIP, to introduce zero-shot adversarial robustness to DNNs. We propose LAAT, a Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category, which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model on novel categories. However, naively using text encoders leads to poor results. Through analysis, we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models, especially when labeled data is unavailable during training.

3.6CVDec 4, 2025
Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

Wentao Wang, Chunyang Liu, Kehua Sheng et al.

The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.

17.9LGMay 7, 2020Code
Noisy Differentiable Architecture Search

Xiangxiang Chu, Bo Zhang

Simplicity is the ultimate sophistication. Differentiable Architecture Search (DARTS) has now become one of the mainstream paradigms of neural architecture search. However, it largely suffers from the well-known performance collapse issue due to the aggregation of skip connections. It is thought to have overly benefited from the residual structure which accelerates the information flow. To weaken this impact, we propose to inject unbiased random noise to impede the flow. We name this novel approach NoisyDARTS. In effect, a network optimizer should perceive this difficulty at each training step and refrain from overshooting, especially on skip connections. In the long run, since we add no bias to the gradient in terms of expectation, it is still likely to converge to the right solution area. We also prove that the injected noise plays a role in smoothing the loss landscape, which makes the optimization easier. Our method features extreme simplicity and acts as a new strong baseline. We perform extensive experiments across various search spaces, datasets, and tasks, where we robustly achieve state-of-the-art results. Our code is available at https://github.com/xiaomi-automl/NoisyDARTS.

17.4CVJan 27, 2025
Object Detection for Medical Image Analysis: Insights from the RT-DETR Model

Weijie He, Yuwei Zhang, Ting Xu et al.

Deep learning has emerged as a transformative approach for solving complex pattern recognition and object detection challenges. This paper focuses on the application of a novel detection framework based on the RT-DETR model for analyzing intricate image data, particularly in areas such as diabetic retinopathy detection. Diabetic retinopathy, a leading cause of vision loss globally, requires accurate and efficient image analysis to identify early-stage lesions. The proposed RT-DETR model, built on a Transformer-based architecture, excels at processing high-dimensional and complex visual data with enhanced robustness and accuracy. Comparative evaluations with models such as YOLOv5, YOLOv8, SSD, and DETR demonstrate that RT-DETR achieves superior performance across precision, recall, mAP50, and mAP50-95 metrics, particularly in detecting small-scale objects and densely packed targets. This study underscores the potential of Transformer-based models like RT-DETR for advancing object detection tasks, offering promising applications in medical imaging and beyond.

5.6CVSep 19, 2021Code
Joint Distribution Alignment via Adversarial Learning for Domain Adaptive Object Detection

Bo Zhang, Tao Chen, Bin Wang et al.

Unsupervised domain adaptive object detection aims to adapt a well-trained detector from its original source domain with rich labeled data to a new target domain with unlabeled data. Recently, mainstream approaches perform this task through adversarial learning, yet still suffer from two limitations. First, they mainly align marginal distribution by unsupervised cross-domain feature matching, and ignore each feature's categorical and positional information that can be exploited for conditional alignment; Second, they treat all classes as equally important for transferring cross-domain knowledge and ignore that different classes usually have different transferability. In this paper, we propose a joint adaptive detection framework (JADF) to address the above challenges. First, an end-to-end joint adversarial adaptation framework for object detection is proposed, which aligns both marginal and conditional distributions between domains without introducing any extra hyperparameter. Next, to consider the transferability of each object class, a metric for class-wise transferability assessment is proposed, which is incorporated into the JADF objective for domain adaptation. Further, an extended study from unsupervised domain adaptation (UDA) to unsupervised few-shot domain adaptation (UFDA) is conducted, where only a few unlabeled training images are available in unlabeled target domain. Extensive experiments validate that JADF is effective in both the UDA and UFDA settings, achieving significant performance gains over existing state-of-the-art cross-domain detection methods.

5.6CVAug 30, 2021
Densely Semantic Enhancement for Domain Adaptive Region-free Detectors

Bo Zhang, Tao Chen, Bin Wang et al.

Unsupervised domain adaptive object detection aims to adapt a well-trained detector from its original source domain with rich labeled data to a new target domain with unlabeled data. Previous works focus on improving the domain adaptability of region-based detectors, e.g., Faster-RCNN, through matching cross-domain instance-level features that are explicitly extracted from a region proposal network (RPN). However, this is unsuitable for region-free detectors such as single shot detector (SSD), which perform a dense prediction from all possible locations in an image and do not have the RPN to encode such instance-level features. As a result, they fail to align important image regions and crucial instance-level features between the domains of region-free detectors. In this work, we propose an adversarial module to strengthen the cross-domain matching of instance-level features for region-free detectors. Firstly, to emphasize the important regions of image, the DSEM learns to predict a transferable foreground enhancement mask that can be utilized to suppress the background disturbance in an image. Secondly, considering that region-free detectors recognize objects of different scales using multi-scale feature maps, the DSEM encodes both multi-level semantic representations and multi-instance spatial-contextual relationships across different domains. Finally, the DSEM is pluggable into different region-free detectors, ultimately achieving the densely semantic feature matching via adversarial learning. Extensive experiments have been conducted on PASCAL VOC, Clipart, Comic, Watercolor, and FoggyCityscape benchmarks, and their results well demonstrate that the proposed approach not only improves the domain adaptability of region-free detectors but also outperforms existing domain adaptive region-based detectors under various domain shift settings.

6.5CVAug 30, 2021
Object-aware Long-short-range Spatial Alignment for Few-Shot Fine-Grained Image Classification

Yike Wu, Bo Zhang, Gang Yu et al.

The goal of few-shot fine-grained image classification is to recognize rarely seen fine-grained objects in the query set, given only a few samples of this class in the support set. Previous works focus on learning discriminative image features from a limited number of training samples for distinguishing various fine-grained classes, but ignore one important fact that spatial alignment of the discriminative semantic features between the query image with arbitrary changes and the support image, is also critical for computing the semantic similarity between each support-query pair. In this work, we propose an object-aware long-short-range spatial alignment approach, which is composed of a foreground object feature enhancement (FOE) module, a long-range semantic correspondence (LSC) module and a short-range spatial manipulation (SSM) module. The FOE is developed to weaken background disturbance and encourage higher foreground object response. To address the problem of long-range object feature misalignment between support-query image pairs, the LSC is proposed to learn the transferable long-range semantic correspondence by a designed feature similarity metric. Further, the SSM module is developed to refine the transformed support feature after the long-range step to align short-range misaligned features (or local details) with the query features. Extensive experiments have been conducted on four benchmark datasets, and the results show superior performance over most state-of-the-art methods under both 1-shot and 5-shot classification scenarios.

12.1CVMar 26, 2021
MagDR: Mask-guided Detection and Reconstruction for Defending Deepfakes

Zhikai Chen, Lingxi Xie, Shanmin Pang et al.

Deepfakes raised serious concerns on the authenticity of visual contents. Prior works revealed the possibility to disrupt deepfakes by adding adversarial perturbations to the source data, but we argue that the threat has not been eliminated yet. This paper presents MagDR, a mask-guided detection and reconstruction pipeline for defending deepfakes from adversarial attacks. MagDR starts with a detection module that defines a few criteria to judge the abnormality of the output of deepfakes, and then uses it to guide a learnable reconstruction procedure. Adaptive masks are extracted to capture the change in local facial regions. In experiments, MagDR defends three main tasks of deepfakes, and the learned reconstruction pipeline transfers across input data, showing promising performance in defending both black-box and white-box attacks.

36.9CVJan 26, 2021Code
Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation

Pan Zhang, Bo Zhang, Ting Zhang et al.

Self-training is a competitive approach in domain adaptive segmentation, which trains the network with the pseudo labels on the target domain. However inevitably, the pseudo labels are noisy and the target features are dispersed due to the discrepancy between source and target domains. In this paper, we rely on representative prototypes, the feature centroids of classes, to address the two issues for unsupervised domain adaptation. In particular, we take one step further and exploit the feature distances from prototypes that provide richer information than mere prototypes. Specifically, we use it to estimate the likelihood of pseudo labels to facilitate online correction in the course of training. Meanwhile, we align the prototypical assignments based on relative feature distances for two different views of the same target, producing a more compact target feature space. Moreover, we find that distilling the already learned knowledge to a self-supervised pretrained model further boosts the performance. Our method shows tremendous performance advantage over state-of-the-art methods. We will make the code publicly available.

29.8CVApr 12, 2020
Cross-domain Correspondence Learning for Exemplar-based Image Translation

Pan Zhang, Bo Zhang, Dong Chen et al.

We present a general framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain (e.g., semantic segmentation mask, or edge map, or pose keypoints), given an exemplar image. The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar. We propose to jointly learn the crossdomain correspondence and the image translation, where both tasks facilitate each other and thus can be learned with weak supervision. The images from distinct domains are first aligned to an intermediate domain where dense correspondence is established. Then, the network synthesizes images based on the appearance of semantically corresponding patches in the exemplar. We demonstrate the effectiveness of our approach in several image translation tasks. Our method is superior to state-of-the-art methods in terms of image quality significantly, with the image style faithful to the exemplar with semantic consistency. Moreover, we show the utility of our method for several applications