Qiang Fang

CV
h-index7
13papers
746citations
Novelty44%
AI Score45

13 Papers

CVJul 2, 2024Code
Similarity Distance-Based Label Assignment for Tiny Object Detection

Shuohao Shi, Qiang Fang, Tong Zhao et al.

Tiny object detection is becoming one of the most challenging tasks in computer vision because of the limited object size and lack of information. The label assignment strategy is a key factor affecting the accuracy of object detection. Although there are some effective label assignment strategies for tiny objects, most of them focus on reducing the sensitivity to the bounding boxes to increase the number of positive samples and have some fixed hyperparameters need to set. However, more positive samples may not necessarily lead to better detection results, in fact, excessive positive samples may lead to more false positives. In this paper, we introduce a simple but effective strategy named the Similarity Distance (SimD) to evaluate the similarity between bounding boxes. This proposed strategy not only considers both location and shape similarity but also learns hyperparameters adaptively, ensuring that it can adapt to different datasets and various object sizes in a dataset. Our approach can be simply applied in common anchor-based detectors in place of the IoU for label assignment and Non Maximum Suppression (NMS). Extensive experiments on four mainstream tiny object detection datasets demonstrate superior performance of our method, especially, 1.8 AP points and 4.1 AP points of very tiny higher than the state-of-the-art competitors on AI-TOD. Code is available at: \url{https://github.com/cszzshi/SimD}.

CVNov 21, 2023Code
Density-Guided Dense Pseudo Label Selection For Semi-supervised Oriented Object Detection

Tong Zhao, Qiang Fang, Shuohao Shi et al.

Recently, dense pseudo-label, which directly selects pseudo labels from the original output of the teacher model without any complicated post-processing steps, has received considerable attention in semi-supervised object detection (SSOD). However, for the multi-oriented and dense objects that are common in aerial scenes, existing dense pseudo-label selection methods are inefficient because they ignore the significant density difference. Therefore, we propose Density-Guided Dense Pseudo Label Selection (DDPLS) for semi-supervised oriented object detection. In DDPLS, we design a simple but effective adaptive mechanism to guide the selection of dense pseudo labels. Specifically, we propose the Pseudo Density Score (PDS) to estimate the density of potential objects and use this score to select reliable dense pseudo labels. On the DOTA-v1.5 benchmark, the proposed method outperforms previous methods especially when labeled data are scarce. For example, it achieves 49.78 mAP given only 5\% of annotated data, which surpasses previous state-of-the-art method given 10\% of annotated data by 1.15 mAP. Our codes is available at https://github.com/Haru-zt/DDPLS.

LGDec 21, 2025Code
SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Pengcheng Li, Qiang Fang, Tong Zhao et al.

Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at https://github.com/positron-lpc/SD2AIL.

SDApr 1, 2022
Residual-guided Personalized Speech Synthesis based on Face Image

Jianrong Wang, Zixuan Wang, Xiaosheng Hu et al.

Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesis Model (FR-PSS) containing a speech encoder, a speech synthesizer and a face encoder is designed for PSS. In this model, by designing two speech priors, a residual-guided strategy is introduced to guide the face feature to approach the true speech feature in the training. Moreover, considering the error of feature's absolute values and their directional bias, we formulate a novel tri-item loss function for face encoder. Experimental results show that the speech synthesized by our model is comparable to the personalized speech synthesized by training a large amount of audio data in previous works.

CVDec 12, 2023Code
Continual Learning through Networks Splitting and Merging with Dreaming-Meta-Weighted Model Fusion

Yi Sun, Xin Xu, Jian Li et al.

It's challenging to balance the networks stability and plasticity in continual learning scenarios, considering stability suffers from the update of model and plasticity benefits from it. Existing works usually focus more on the stability and restrict the learning plasticity of later tasks to avoid catastrophic forgetting of learned knowledge. Differently, we propose a continual learning method named Split2MetaFusion which can achieve better trade-off by employing a two-stage strategy: splitting and meta-weighted fusion. In this strategy, a slow model with better stability, and a fast model with better plasticity are learned sequentially at the splitting stage. Then stability and plasticity are both kept by fusing the two models in an adaptive manner. Towards this end, we design an optimizer named Task-Preferred Null Space Projector(TPNSP) to the slow learning process for narrowing the fusion gap. To achieve better model fusion, we further design a Dreaming-Meta-Weighted fusion policy for better maintaining the old and new knowledge simultaneously, which doesn't require to use the previous datasets. Experimental results and analysis reported in this work demonstrate the superiority of the proposed method for maintaining networks stability and keeping its plasticity. Our code will be released.

CLJul 15, 2021Code
Solving ESL Sentence Completion Questions via Pre-trained Neural Language Models

Qiongqiong Liu, Tianqiao Liu, Jiafu Zhao et al.

Sentence completion (SC) questions present a sentence with one or more blanks that need to be filled in, three to five possible words or phrases as options. SC questions are widely used for students learning English as a Second Language (ESL) and building computational approaches to automatically solve such questions is beneficial to language learners. In this work, we propose a neural framework to solve SC questions in English examinations by utilizing pre-trained language models. We conduct extensive experiments on a real-world K-12 ESL SC question dataset and the results demonstrate the superiority of our model in terms of prediction accuracy. Furthermore, we run precision-recall trade-off analysis to discuss the practical issues when deploying it in real-life scenarios. To encourage reproducible results, we make our code publicly available at \url{https://github.com/AIED2021/ESL-SentenceCompletion}.

MMJun 25, 2021Code
Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

Jianrong Wang, Ziyue Tang, Xuewei Li et al.

Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.

CVOct 13, 2020Code
Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Jianrong Wang, Tong Wu, Shanyu Wang et al.

Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two-dimensional (2D) lip images to recognize speaker in a textdependent context. However, 2D lip easily suffers from various face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at https://github.com/wutong18/Three-Dimensional-Lip- Motion-Network-for-Text-Independent-Speaker-Recognition.

CLOct 13, 2020Code
Mathematical Word Problem Generation from Commonsense Knowledge Graph and Equations

Tianqiao Liu, Qiang Fang, Wenbiao Ding et al.

There is an increasing interest in the use of mathematical word problem (MWP) generation in educational assessment. Different from standard natural question generation, MWP generation needs to maintain the underlying mathematical operations between quantities and variables, while at the same time ensuring the relevance between the output and the given topic. To address above problem, we develop an end-to-end neural model to generate diverse MWPs in real-world scenarios from commonsense knowledge graph and equations. The proposed model (1) learns both representations from edge-enhanced Levi graphs of symbolic equations and commonsense knowledge; (2) automatically fuses equation and commonsense knowledge information via a self-planning module when generating the MWPs. Experiments on an educational gold-standard set and a large-scale generated MWP set show that our approach is superior on the MWP generation task, and it outperforms the SOTA models in terms of both automatic evaluation metrics, i.e., BLEU-4, ROUGE-L, Self-BLEU, and human evaluation metrics, i.e., equation relevance, topic relevance, and language coherence. To encourage reproducible results, we make our code and MWP dataset public available at \url{https://github.com/tal-ai/MaKE_EMNLP2021}.

MMJun 26, 2021
An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech

Jianrong Wang, Nan Gu, Mei Yu et al.

Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS datasets, we build a new British English dataset containing 5 native CS speakers. Evaluation results on both French and British English datasets show that our model achieves over 90% accuracy in hand shape recognition. Significant improvements of 8.75% (for French) and 10.09% (for British English) are achieved in CS phoneme recognition correctness compared with the state-of-the-art.

ROFeb 24, 2018
Euler angles based loss function for camera relocalization with Deep learning

Qiang Fang, Tianjiang Hu

Deep learning has been applied to camera relocalization, in particular, PoseNet and its extended work are the convolutional neural networks which regress the camera pose from a single image. However there are many problems, one of them is expensive parameter selection. In this paper, we directly explore the three Euler angles as the orientation representation in the camera pose regressor. There is no need to select the parameter, which is not tolerant in the previous works. Experimental results on the 7 Scenes datasets and the King's College dataset demonstrate that it has competitive performances.

RONov 10, 2016
Vision-aided Localization and Navigation Based on Trifocal Tensor

Qiang Fang

In this paper, a novel method for vision-aided navigation based on trifocal tensor is presented. The main goal of the proposed method is to provide position estimation in GPS-denied environments for vehicles equipped with a standard inertial navigation systems(INS) and a single camera only. We treat the trifocal tensor as the measurement model, being only concerned about the vehicle state and do not estimate the the position of the tracked landmarks. The performance of the proposed method is demonstrated using simulation and experimental data.

ROJul 26, 2012
A Unified Approach of Observability Analysis for Airborne SLAM

Qiang Fang, Xinsheng Huang

Observability is a key aspect of the state estimation problem of SLAM, However, the dimension and variables of SLAM system might be changed with new features, to which little attention is paid in the previous work. In this paper, a unified approach of observability analysis for SLAM system is provided, whether the dimension and variables of SLAM system are changed or not, we can use this approach to analyze the local or total observability of the SLAM system.