Tengfei Xing

CV
h-index18
11papers
206citations
Novelty46%
AI Score36

11 Papers

CVApr 9, 2022
S4OD: Semi-Supervised learning for Single-Stage Object Detection

Yueming Zhang, Xingxu Yao, Chao Liu et al.

Single-stage detectors suffer from extreme foreground-background class imbalance, while two-stage detectors do not. Therefore, in semi-supervised object detection, two-stage detectors can deliver remarkable performance by only selecting high-quality pseudo labels based on classification scores. However, directly applying this strategy to single-stage detectors would aggravate the class imbalance with fewer positive samples. Thus, single-stage detectors have to consider both quality and quantity of pseudo labels simultaneously. In this paper, we design a dynamic self-adaptive threshold (DSAT) strategy in classification branch, which can automatically select pseudo labels to achieve an optimal trade-off between quality and quantity. Besides, to assess the regression quality of pseudo labels in single-stage detectors, we propose a module to compute the regression uncertainty of boxes based on Non-Maximum Suppression. By leveraging only 10% labeled data from COCO, our method achieves 35.0% AP on anchor-free detector (FCOS) and 32.9% on anchor-based detector (RetinaNet).

CVApr 23, 2023
CANet: Curved Guide Line Network with Adaptive Decoder for Lane Detection

Zhongyu Yang, Chen Shen, Wei Shao et al.

Lane detection is challenging due to the complicated on road scenarios and line deformation from different camera perspectives. Lots of solutions were proposed, but can not deal with corner lanes well. To address this problem, this paper proposes a new top-down deep learning lane detection approach, CANET. A lane instance is first responded by the heat-map on the U-shaped curved guide line at global semantic level, thus the corresponding features of each lane are aggregated at the response point. Then CANET obtains the heat-map response of the entire lane through conditional convolution, and finally decodes the point set to describe lanes via adaptive decoder. The experimental results show that CANET reaches SOTA in different metrics. Our code will be released soon.

CVAug 29, 2024
Multi-source Domain Adaptation for Panoramic Semantic Segmentation

Jing Jiang, Sicheng Zhao, Jiankun Zhu et al.

Unsupervised domain adaptation methods for panoramic semantic segmentation utilize real pinhole images or low-cost synthetic panoramic images to transfer segmentation models to real panoramic images. However, these methods struggle to understand the panoramic structure using only real pinhole images and lack real-world scene perception with only synthetic panoramic images. Therefore, in this paper, we propose a new task, Multi-source Domain Adaptation for Panoramic Semantic Segmentation (MSDA4PASS), which leverages both real pinhole and synthetic panoramic images to improve segmentation on unlabeled real panoramic images. There are two key issues in the MSDA4PASS task: (1) distortion gaps between the pinhole and panoramic domains -- panoramic images exhibit global and local distortions absent in pinhole images; (2) texture gaps between the source and target domains -- scenes and styles differ across domains. To address these two issues, we propose a novel framework, Deformation Transform Aligner for Panoramic Semantic Segmentation (DTA4PASS), which converts all pinhole images in the source domains into distorted images and aligns the source distorted and panoramic images with the target panoramic images. Specifically, DTA4PASS consists of two main components: Unpaired Semantic Morphing (USM) and Distortion Gating Alignment (DGA). First, in USM, the Dual-view Discriminator (DvD) assists in training the diffeomorphic deformation network at the image and pixel level, enabling the effective deformation transformation of pinhole images without paired panoramic views, alleviating distortion gaps. Second, DGA assigns pinhole-like (pin-like) and panoramic-like (pan-like) features to each image by gating, and aligns these two features through uncertainty estimation, reducing texture gaps.

CVFeb 12, 2020Code
An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Sicheng Zhao, Yunsheng Ma, Yang Gu et al.

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

CVMar 21, 2024
LDTR: Transformer-based Lane Detection with Anchor-chain Representation

Zhongyu Yang, Chen Shen, Wei Shao et al.

Despite recent advances in lane detection methods, scenarios with limited- or no-visual-clue of lanes due to factors such as lighting conditions and occlusion remain challenging and crucial for automated driving. Moreover, current lane representations require complex post-processing and struggle with specific instances. Inspired by the DETR architecture, we propose LDTR, a transformer-based model to address these issues. Lanes are modeled with a novel anchor-chain, regarding a lane as a whole from the beginning, which enables LDTR to handle special lanes inherently. To enhance lane instance perception, LDTR incorporates a novel multi-referenced deformable attention module to distribute attention around the object. Additionally, LDTR incorporates two line IoU algorithms to improve convergence efficiency and employs a Gaussian heatmap auxiliary branch to enhance model representation capability during training. To evaluate lane detection models, we rely on Frechet distance, parameterized F1-score, and additional synthetic metrics. Experimental results demonstrate that LDTR achieves state-of-the-art performance on well-known datasets.

CVOct 13, 2025
Source-Free Object Detection with Detection Transformer

Huizai Yao, Sicheng Zhao, Shuo Lu et al.

Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data. Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR). In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs. FRANCK comprises four key components: (1) an Objectness Score-based Sample Reweighting (OSSR) module that computes attention-based objectness scores on multi-scale encoder feature maps, reweighting the detection loss to emphasize less-recognized regions; (2) a Contrastive Learning with Matching-based Memory Bank (CMMB) module that integrates multi-level features into memory banks, enhancing class-wise contrastive learning; (3) an Uncertainty-weighted Query-fused Feature Distillation (UQFD) module that improves feature distillation through prediction quality reweighting and query feature fusion; and (4) an improved self-training pipeline with a Dynamic Teacher Updating Interval (DTUI) that optimizes pseudo-label quality. By leveraging these components, FRANCK effectively adapts a source-pre-trained DETR model to a target domain with enhanced robustness and generalization. Extensive experiments on several widely used benchmarks demonstrate that our method achieves state-of-the-art performance, highlighting its effectiveness and compatibility with DETR-based SFOD models.

MTRL-SCIApr 26, 2025
Predicting Stress in Two-phase Random Materials and Super-Resolution Method for Stress Images by Embedding Physical Information

Tengfei Xing, Xiaodan Ren, Jie Li

Stress analysis is an important part of material design. For materials with complex microstructures, such as two-phase random materials (TRMs), material failure is often accompanied by stress concentration. Phase interfaces in two-phase materials are critical for stress concentration. Therefore, the prediction error of stress at phase boundaries is crucial. In practical engineering, the pixels of the obtained material microstructure images are limited, which limits the resolution of stress images generated by deep learning methods, making it difficult to observe stress concentration regions. Existing Image Super-Resolution (ISR) technologies are all based on data-driven supervised learning. However, stress images have natural physical constraints, which provide new ideas for new ISR technologies. In this study, we constructed a stress prediction framework for TRMs. First, the framework uses a proposed Multiple Compositions U-net (MC U-net) to predict stress in low-resolution material microstructures. By considering the phase interface information of the microstructure, the MC U-net effectively reduces the problem of excessive prediction errors at phase boundaries. Secondly, a Mixed Physics-Informed Neural Network (MPINN) based method for stress ISR (SRPINN) was proposed. By introducing the constraints of physical information, the new method does not require paired stress images for training and can increase the resolution of stress images to any multiple. This enables a multiscale analysis of the stress concentration regions at phase boundaries. Finally, we performed stress analysis on TRMs with different phase volume fractions and loading states through transfer learning. The results show the proposed stress prediction framework has satisfactory accuracy and generalization ability.

LGApr 26, 2025
Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials

Tengfei Xing, Xiaodan Ren, Jie Li

Material stress analysis is a critical aspect of material design and performance optimization. Under dynamic loading, the global stress evolution in materials exhibits complex spatiotemporal characteristics, especially in two-phase random materials (TRMs). Such kind of material failure is often associated with stress concentration, and the phase boundaries are key locations where stress concentration occurs. In practical engineering applications, the spatiotemporal resolution of acquired microstructural data and its dynamic stress evolution is often limited. This poses challenges for deep learning methods in generating high-resolution spatiotemporal stress fields, particularly for accurately capturing stress concentration regions. In this study, we propose a framework for global stress generation and spatiotemporal super-resolution in TRMs under dynamic loading. First, we introduce a diffusion model-based approach, named as Spatiotemporal Stress Diffusion (STS-diffusion), for generating global spatiotemporal stress data. This framework incorporates Space-Time U-Net (STU-net), and we systematically investigate the impact of different attention positions on model accuracy. Next, we develop a physics-informed network for spatiotemporal super-resolution, termed as Spatiotemporal Super-Resolution Physics-Informed Operator (ST-SRPINN). The proposed ST-SRPINN is an unsupervised learning method. The influence of data-driven and physics-informed loss function weights on model accuracy is explored in detail. Benefiting from physics-based constraints, ST-SRPINN requires only low-resolution stress field data during training and can upscale the spatiotemporal resolution of stress fields to arbitrary magnifications.

CVJun 14, 2024
MapVision: CVPR 2024 Autonomous Grand Challenge Mapless Driving Tech Report

Zhongyu Yang, Mai Liu, Jinluo Xie et al.

Autonomous driving without high-definition (HD) maps demands a higher level of active scene understanding. In this competition, the organizers provided the multi-perspective camera images and standard-definition (SD) maps to explore the boundaries of scene reasoning capabilities. We found that most existing algorithms construct Bird's Eye View (BEV) features from these multi-perspective images and use multi-task heads to delineate road centerlines, boundary lines, pedestrian crossings, and other areas. However, these algorithms perform poorly at the far end of roads and struggle when the primary subject in the image is occluded. Therefore, in this competition, we not only used multi-perspective images as input but also incorporated SD maps to address this issue. We employed map encoder pre-training to enhance the network's geometric encoding capabilities and utilized YOLOX to improve traffic element detection precision. Additionally, for area detection, we innovatively introduced LDTR and auxiliary tasks to achieve higher precision. As a result, our final OLUS score is 0.58.

CVOct 27, 2021
3rd Place Solution for VisDA 2021 Challenge -- Universally Domain Adaptive Image Recognition

Haojin Liao, Xiaolin Song, Sicheng Zhao et al.

The Visual Domain Adaptation (VisDA) 2021 Challenge calls for unsupervised domain adaptation (UDA) methods that can deal with both input distribution shift and label set variance between the source and target domains. In this report, we introduce a universal domain adaptation (UniDA) method by aggregating several popular feature extraction and domain adaptation schemes. First, we utilize VOLO, a Transformer-based architecture with state-of-the-art performance in several visual tasks, as the backbone to extract effective feature representations. Second, we modify the open-set classifier of OVANet to recognize the unknown class with competitive accuracy and robustness. As shown in the leaderboard, our proposed UniDA method ranks the 3rd place with 48.49% ACC and 70.8% AUROC in the VisDA 2021 Challenge.

CVJun 16, 2021
2nd Place Solution for Waymo Open Dataset Challenge -- Real-time 2D Object Detection

Yueming Zhang, Xiaolin Song, Bing Bai et al.

In an autonomous driving system, it is essential to recognize vehicles, pedestrians and cyclists from images. Besides the high accuracy of the prediction, the requirement of real-time running brings new challenges for convolutional network models. In this report, we introduce a real-time method to detect the 2D objects from images. We aggregate several popular one-stage object detectors and train the models of variety input strategies independently, to yield better performance for accurate multi-scale detection of each category, especially for small objects. For model acceleration, we leverage TensorRT to optimize the inference time of our detection pipeline. As shown in the leaderboard, our proposed detection framework ranks the 2nd place with 75.00% L1 mAP and 69.72% L2 mAP in the real-time 2D detection track of the Waymo Open Dataset Challenges, while our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.