CVMay 31
SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic SegmentationAditya Makineni, Qing Tian
Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student's feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student's reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.
LGOct 3, 2022
Multi-view information fusion using multi-view variational autoencoders to predict proximal femoral strengthChen Zhao, Joyce H Keyak, Xuewei Cao et al.
The aim of this paper is to design a deep learning-based model to predict proximal femoral strength using multi-view information fusion. Method: We developed new models using multi-view variational autoencoder (MVAE) for feature representation learning and a product of expert (PoE) model for multi-view information fusion. We applied the proposed models to an in-house Louisiana Osteoporosis Study (LOS) cohort with 931 male subjects, including 345 African Americans and 586 Caucasians. With an analytical solution of the product of Gaussian distribution, we adopted variational inference to train the designed MVAE-PoE model to perform common latent feature extraction. We performed genome-wide association studies (GWAS) to select 256 genetic variants with the lowest p-values for each proximal femoral strength and integrated whole genome sequence (WGS) features and DXA-derived imaging features to predict proximal femoral strength. Results: The best prediction model for fall fracture load was acquired by integrating WGS features and DXA-derived imaging features. The designed models achieved the mean absolute percentage error of 18.04%, 6.84% and 7.95% for predicting proximal femoral fracture loads using linear models of fall loading, nonlinear models of fall loading, and nonlinear models of stance loading, respectively. Compared to existing multi-view information fusion methods, the proposed MVAE-PoE achieved the best performance. Conclusion: The proposed models are capable of predicting proximal femoral strength using WGS features and DXA-derived imaging features. Though this tool is not a substitute for FEA using QCT images, it would make improved assessment of hip fracture risk more widely available while avoiding the increased radiation dosage and clinical costs from QCT.
CVMar 7, 2023
TMHOI: Translational Model for Human-Object Interaction DetectionLijing Zhu, Qizhen Lan, Alvaro Velasquez et al.
Detecting human-object interactions (HOIs) is an intricate challenge in the field of computer vision. Existing methods for HOI detection heavily rely on appearance-based features, but these may not fully capture all the essential characteristics necessary for accurate detection. To overcome these challenges, we propose an innovative graph-based approach called TMGHOI (Translational Model for Human-Object Interaction Detection). Our method effectively captures the sentiment representation of HOIs by integrating both spatial and semantic knowledge. By representing HOIs as a graph, where the interaction components serve as nodes and their spatial relationships as edges. To extract crucial spatial and semantic information, TMGHOI employs separate spatial and semantic encoders. Subsequently, these encodings are combined to construct a knowledge graph that effectively captures the sentiment representation of HOIs. Additionally, the ability to incorporate prior knowledge enhances the understanding of interactions, further boosting detection accuracy. We conducted extensive evaluations on the widely-used HICO-DET datasets to demonstrate the effectiveness of TMGHOI. Our approach outperformed existing state-of-the-art graph-based methods by a significant margin, showcasing its potential as a superior solution for HOI detection. We are confident that TMGHOI has the potential to significantly improve the accuracy and efficiency of HOI detection. Its integration of spatial and semantic knowledge, along with its computational efficiency and practicality, makes it a valuable tool for researchers and practitioners in the computer vision community. As with any research, we acknowledge the importance of further exploration and evaluation on various datasets to establish the generalizability and robustness of our proposed method.
CVDec 12, 2022Code
Comparison Of Deep Object Detectors On A New Vulnerable Pedestrian DatasetDevansh Sharma, Tihitina Hade, Qing Tian
Pedestrian safety is one primary concern in autonomous driving. The under-representation of vulnerable groups in today's pedestrian datasets points to an urgent need for a dataset of vulnerable road users. In order to help train comprehensive models and subsequently drive research to improve the accuracy of vulnerable pedestrian identification, we first introduce a new dataset for vulnerable pedestrian detection in this paper: the BG Vulnerable Pedestrian (BGVP) dataset. The dataset includes four classes, i.e., Children Without Disability, Elderly without Disability, With Disability, and Non-Vulnerable. This dataset consists of images collected from the public domain and manually-annotated bounding boxes. In addition, on the proposed dataset, we have trained and tested five classic or state-of-the-art object detection models, i.e., YOLOv4, YOLOv5, YOLOX, Faster R-CNN, and EfficientDet. Our results indicate that YOLOX and YOLOv4 perform the best on our dataset, YOLOv4 scoring 0.7999 and YOLOX scoring 0.7779 on the mAP 0.5 metric, while YOLOX outperforms YOLOv4 by 3.8 percent on the mAP 0.5:0.95 metric. Generally speaking, all five detectors do well predicting the With Disability class and perform poorly in the Elderly Without Disability class. YOLOX consistently outperforms all other detectors on the mAP (0.5:0.95) per class metric, obtaining 0.5644, 0.5242, 0.4781, and 0.6796 for Children Without Disability, Elderly Without Disability, Non-vulnerable, and With Disability, respectively. Our dataset and codes are available at https://github.com/devvansh1997/BGVP.
CVMar 7, 2023
Gradient-Guided Knowledge Distillation for Object DetectorsQizhen Lan, Qing Tian
Deep learning models have demonstrated remarkable success in object detection, yet their complexity and computational intensity pose a barrier to deploying them in real-world applications (e.g., self-driving perception). Knowledge Distillation (KD) is an effective way to derive efficient models. However, only a small number of KD methods tackle object detection. Also, most of them focus on mimicking the plain features of the teacher model but rarely consider how the features contribute to the final detection. In this paper, we propose a novel approach for knowledge distillation in object detection, named Gradient-guided Knowledge Distillation (GKD). Our GKD uses gradient information to identify and assign more weights to features that significantly impact the detection loss, allowing the student to learn the most relevant features from the teacher. Furthermore, we present bounding-box-aware multi-grained feature imitation (BMFI) to further improve the KD performance. Experiments on the KITTI and COCO-Traffic datasets demonstrate our method's efficacy in knowledge distillation for object detection. On one-stage and two-stage detectors, our GKD-BMFI leads to an average of 5.1% and 3.8% mAP improvement, respectively, beating various state-of-the-art KD methods.
CVNov 21, 2022
Enhancing Accuracy and Robustness of Steering Angle Prediction with Attention MechanismSwetha Nadella, Pramiti Barua, Jeremy C. Hagler et al.
In this paper, our focus is on enhancing steering angle prediction for autonomous driving tasks. We initiate our exploration by investigating two veins of widely adopted deep neural architectures, namely ResNets and InceptionNets. Within both families, we systematically evaluate various model sizes to understand their impact on performance. Notably, our key contribution lies in the incorporation of an attention mechanism to augment steering angle prediction accuracy and robustness. By introducing attention, our models gain the ability to selectively focus on crucial regions within the input data, leading to improved predictive outcomes. Our findings showcase that our attention-enhanced models not only achieve state-of-the-art results in terms of steering angle Mean Squared Error (MSE) but also exhibit enhanced adversarial robustness, addressing critical concerns in real-world deployment. For example, in our experiments on the Kaggle SAP and our created publicly available datasets, attention can lead to over 6% error reduction in steering angle prediction and boost model robustness by up to 56.09%.
CVMar 4, 2023
Visual Saliency-Guided Channel Pruning for Deep Visual Detectors in Autonomous DrivingJung Im Choi, Qing Tian
Deep neural network (DNN) pruning has become a de facto component for deploying on resource-constrained devices since it can reduce memory requirements and computation costs during inference. In particular, channel pruning gained more popularity due to its structured nature and direct savings on general hardware. However, most existing pruning approaches utilize importance measures that are not directly related to the task utility. Moreover, few in the literature focus on visual detection models. To fill these gaps, we propose a novel gradient-based saliency measure for visual detection and use it to guide our channel pruning. Experiments on the KITTI and COCO traffic datasets demonstrate our pruning method's efficacy and superiority over state-of-the-art competing approaches. It can even achieve better performance with fewer parameters than the original model. Our pruning also demonstrates great potential in handling small-scale objects.
CVMay 12, 2025Code
Topology-Guided Knowledge Distillation for Efficient Point Cloud ProcessingLuu Tung Hai, Thinh D. Le, Zhicheng Ding et al.
Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16x reduction in model size and a nearly 1.9x decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available publicly at: https://github.com/HySonLab/PointDistill
CLJun 9, 2025Code
ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph EmbeddingLijing Zhu, Qizhen Lan, Qing Tian et al.
Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: https://github.com/lijingzhu1/ETT-CKGE/tree/main
CVMar 27
Learnable Instance Attention Filtering for Adaptive Detector DistillationChen Liu, Qizhen Lan, Zhicheng Ding et al.
As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.
CVApr 30
AIDA-ReID: Adaptive Intermediate Domain Adaptation for Generalizable and Source-Free Person Re-IdentificationSundas Iqbal, Qing Tian, Danish Ali et al.
Person re-identification (Re-ID) aims to match images of the same individual across non-overlapping camera views and remains challenging due to domain shifts caused by variations in illumination, background, camera characteristics, and population distributions. Although supervised models perform well under matched training and testing conditions, their performance degrades significantly when deployed in unseen environments. Existing intermediate domain approaches such as IDM and IDM++ alleviate this gap by constructing bridge feature distributions between domains; however, they rely on fixed mixing strategies and joint source-target access, limiting their applicability to multi-source and source-free settings. To address these limitations, this paper proposes Adaptive Intermediate Domain Adaptation (AIDA), also referred to as Source-Free Multi-Source Intermediate Domain Adaptation (SF-MIDA). The proposed framework treats intermediate-domain learning as a dynamically regulated process, where feature mixing and regularization strength are adaptively controlled using feedback signals derived from model uncertainty and training stability. A multi-source intermediate domain generator synthesizes diverse intermediate representations, while a pseudo-mirror regularization strategy preserves identity consistency under domain perturbations. Extensive experiments across domain generalization and source-free settings demonstrate the effectiveness of the proposed framework.
CVMay 20, 2024
Multi-dimension Transformer with Attention-based Filtering for Medical Image SegmentationWentao Wang, Xi Xiao, Mingjie Liu et al.
The accurate segmentation of medical images is crucial for diagnosing and treating diseases. Recent studies demonstrate that vision transformer-based methods have significantly improved performance in medical image segmentation, primarily due to their superior ability to establish global relationships among features and adaptability to various inputs. However, these methods struggle with the low signal-to-noise ratio inherent to medical images. Additionally, the effective utilization of channel and spatial information, which are essential for medical image segmentation, is limited by the representation capacity of self-attention. To address these challenges, we propose a multi-dimension transformer with attention-based filtering (MDT-AF), which redesigns the patch embedding and self-attention mechanism for medical image segmentation. MDT-AF incorporates an attention-based feature filtering mechanism into the patch embedding blocks and employs a coarse-to-fine process to mitigate the impact of low signal-to-noise ratio. To better capture complex structures in medical images, MDT-AF extends the self-attention mechanism to incorporate spatial and channel dimensions, enriching feature representation. Moreover, we introduce an interaction mechanism to improve the feature aggregation between spatial and channel dimensions. Experimental results on three public medical image segmentation benchmarks show that MDT-AF achieves state-of-the-art (SOTA) performance.
CVMar 8, 2025
ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge DistillationQizhen Lan, Qing Tian
Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.
CVSep 22, 2025
Visual Detector Compression via Location-Aware Discriminant AnalysisQizhen Lan, Jung Im Choi, Qing Tian
Deep neural networks are powerful, yet their high complexity greatly limits their potential to be deployed on billions of resource-constrained edge devices. Pruning is a crucial network compression technique, yet most existing methods focus on classification models, with limited attention to detection. Even among those addressing detection, there is a lack of utilization of essential localization information. Also, many pruning methods passively rely on pre-trained models, in which useful and useless components are intertwined, making it difficult to remove the latter without harming the former at the neuron/filter level. To address the above issues, in this paper, we propose a proactive detection-discriminants-based network compression approach for deep visual detectors, which alternates between two steps: (1) maximizing and compressing detection-related discriminants and aligning them with a subset of neurons/filters immediately before the detection head, and (2) tracing the detection-related discriminating power across the layers and discarding features of lower importance. Object location information is exploited in both steps. Extensive experiments, employing four advanced detection models and four state-of-the-art competing methods on the KITTI and COCO datasets, highlight the superiority of our approach. Remarkably, our compressed models can even beat the original base models with a substantial reduction in complexity.
CVFeb 15, 2025
CLoCKDistill: Consistent Location-and-Context-aware Knowledge Distillation for DETRsQizhen Lan, Qing Tian
Object detection has advanced significantly with Detection Transformers (DETRs). However, these models are computationally demanding, posing challenges for deployment in resource-constrained environments (e.g., self-driving cars). Knowledge distillation (KD) is an effective compression method widely applied to CNN detectors, but its application to DETR models has been limited. Most KD methods for DETRs fail to distill transformer-specific global context. Also, they blindly believe in the teacher model, which can sometimes be misleading. To bridge the gaps, this paper proposes Consistent Location-and-Context-aware Knowledge Distillation (CLoCKDistill) for DETR detectors, which includes both feature distillation and logit distillation components. For feature distillation, instead of distilling backbone features like existing KD methods, we distill the transformer encoder output (i.e., memory) that contains valuable global context and long-range dependencies. Also, we enrich this memory with object location details during feature distillation so that the student model can prioritize relevant regions while effectively capturing the global context. To facilitate logit distillation, we create target-aware queries based on the ground truth, allowing both the student and teacher decoders to attend to consistent and accurate parts of encoder memory. Experiments on the KITTI and COCO datasets show our CLoCKDistill method's efficacy across various DETRs, e.g., single-scale DAB-DETR, multi-scale deformable DETR, and denoising-based DINO. Our method boosts student detector performance by 2.2% to 6.4%.
LGFeb 11, 2025
PFedDST: Personalized Federated Learning with Decentralized Selection TrainingMengchen Fan, Keren Li, Tianyun Zhang et al.
Distributed Learning (DL) enables the training of machine learning models across multiple devices, yet it faces challenges like non-IID data distributions and device capability disparities, which can impede training efficiency. Communication bottlenecks further complicate traditional Federated Learning (FL) setups. To mitigate these issues, we introduce the Personalized Federated Learning with Decentralized Selection Training (PFedDST) framework. PFedDST enhances model training by allowing devices to strategically evaluate and select peers based on a comprehensive communication score. This score integrates loss, task similarity, and selection frequency, ensuring optimal peer connections. This selection strategy is tailored to increase local personalization and promote beneficial peer collaborations to strengthen the stability and efficiency of the training process. Our experiments demonstrate that PFedDST not only enhances model accuracy but also accelerates convergence. This approach outperforms state-of-the-art methods in handling data heterogeneity, delivering both faster and more effective training in diverse and decentralized systems.
GNMar 31
GenoBERT: A Language Model for Accurate Genotype ImputationLei Huang, Chuan Qiu, Kuan-Jui Su et al.
Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high overall imputation accuracy ($r^2 approx 0.98$) across datasets, and maintains robust performance ($r^2 > 0.90$) even at 50% missingness. Experimental results across different ancestries confirm consistent gains across datasets, with resilience to small sample sizes and weak LD. A 128-SNP (single-nucleotide polymorphism) context window (approximately 100 Kb) is validated through LD-decay analyses as sufficient to capture local correlation structures. By eliminating reference-panel dependence while preserving high accuracy, GenoBERT provides a scalable and robust solution for genotype imputation and a foundation for downstream genomic modeling.
SDAug 28, 2025
Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio ClassificationAditya Makineni, Baocheng Geng, Qing Tian
Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.
CVFeb 10, 2022
Adversarial Attack and Defense of YOLO Detectors in Autonomous Driving ScenariosJung Im Choi, Qing Tian
Visual detection is a key task in autonomous driving, and it serves as a crucial foundation for self-driving planning and control. Deep neural networks have achieved promising results in various visual tasks, but they are known to be vulnerable to adversarial attacks. A comprehensive understanding of deep visual detectors' vulnerability is required before people can improve their robustness. However, only a few adversarial attack/defense works have focused on object detection, and most of them employed only classification and/or localization losses, ignoring the objectness aspect. In this paper, we identify a serious objectness-related adversarial vulnerability in YOLO detectors and present an effective attack strategy targeting the objectness aspect of visual detection in autonomous vehicles. Furthermore, to address such vulnerability, we propose a new objectness-aware adversarial training approach for visual detection. Experiments show that the proposed attack targeting the objectness aspect is 45.17% and 43.50% more effective than those generated from classification and/or localization losses on the KITTI and COCO traffic datasets, respectively. Also, the proposed adversarial defense approach can improve the detectors' robustness against objectness-oriented attacks by up to 21% and 12% mAP on KITTI and COCO traffic, respectively.
CVJan 26, 2022
Adaptive Instance Distillation for Object Detection in Autonomous DrivingQizhen Lan, Qing Tian
In recent years, knowledge distillation (KD) has been widely used to derive efficient models. Through imitating a large teacher model, a lightweight student model can achieve comparable performance with more efficiency. However, most existing knowledge distillation methods are focused on classification tasks. Only a limited number of studies have applied knowledge distillation to object detection, especially in time-sensitive autonomous driving scenarios. In this paper, we propose Adaptive Instance Distillation (AID) to selectively impart teacher's knowledge to the student to improve the performance of knowledge distillation. Unlike previous KD methods that treat all instances equally, our AID can attentively adjust the distillation weights of instances based on the teacher model's prediction loss. We verified the effectiveness of our AID method through experiments on the KITTI and the COCO traffic datasets. The results show that our method improves the performance of state-of-the-art attention-guided and non-local distillation methods and achieves better distillation results on both single-stage and two-stage detectors. Compared to the baseline, our AID led to an average of 2.7% and 2.1% mAP increases for single-stage and two-stage detectors, respectively. Furthermore, our AID is also shown to be useful for self-distillation to improve the teacher model's performance.
CVJan 17, 2021
Improving Apparel Detection with Category Grouping and Multi-grained BranchesQing Tian, Sampath Chanda, K C Amit Kumar et al.
Training an accurate object detector is expensive and time-consuming. One main reason lies in the laborious labeling process, i.e., annotating category and bounding box information for all instances in every image. In this paper, we examine ways to improve performance of deep object detectors without extra labeling. We first explore to group existing categories of high visual and semantic similarities together as one super category (or, a superclass). Then, we study how this knowledge of hierarchical categories can be exploited to better detect object using multi-grained RCNN top branches. Experimental results on DeepFashion2 and OpenImagesV4-Clothing reveal that the proposed detection heads with multi-grained branches can boost the overall performance by 2.3 mAP for DeepFashion2 and 2.5 mAP for OpenImagesV4-Clothing with no additional time-consuming annotations. More importantly, classes that have fewer training samples tend to benefit more from the proposed multi-grained heads with superclass grouping. In particular, we improve the mAP for last 30% categories (in terms of training sample number) by 2.6 and 4.6 for DeepFashion2 and OpenImagesV4-Clothing, respectively.
CVSep 29, 2020
Grow-Push-Prune: aligning deep discriminants for effective structural network compressionQing Tian, Tal Arbel, James J. Clark
Most of today's popular deep architectures are hand-engineered to be generalists. However, this design procedure usually leads to massive redundant, useless, or even harmful features for specific tasks. Unnecessarily high complexities render deep nets impractical for many real-world applications, especially those without powerful GPU support. In this paper, we attempt to derive task-dependent compact models from a deep discriminant analysis perspective. We propose an iterative and proactive approach for classification tasks which alternates between (1) a pushing step, with an objective to simultaneously maximize class separation, penalize co-variances, and push deep discriminants into alignment with a compact set of neurons, and (2) a pruning step, which discards less useful or even interfering neurons. Deconvolution is adopted to reverse 'unimportant' filters' effects and recover useful contributing sources. A simple network growing strategy based on the basic Inception module is proposed for challenging tasks requiring larger capacity than what the base net can offer. Experiments on the MNIST, CIFAR10, and ImageNet datasets demonstrate our approach's efficacy. On ImageNet, by pushing and pruning our grown Inception-88 model, we achieve more accurate models than Inception nets generated during growing, residual nets, and popular compact nets at similar sizes. We also show that our grown Inception nets (without hard-coded dimension alignment) clearly outperform residual nets of similar complexities.
LGMar 18, 2020
Unsupervised Domain Adaptation Through Transferring both the Source-Knowledge and Target-Relatedness SimultaneouslyQing Tian, Yanan Zhu, Chuang Ma et al.
Unsupervised domain adaptation (UDA) is an emerging research topic in the field of machine learning and pattern recognition, which aims to help the learning of unlabeled target domain by transferring knowledge from the source domain.
CVMar 21, 2018
Task dependent Deep LDA pruning of neural networksQing Tian, Tal Arbel, James J. Clark
With deep learning's success, a limited number of popular deep nets have been widely adopted for various vision tasks. However, this usually results in unnecessarily high complexities and possibly many features of low task utility. In this paper, we address this problem by introducing a task-dependent deep pruning framework based on Fisher's Linear Discriminant Analysis (LDA). The approach can be applied to convolutional, fully-connected, and module-based deep network structures, in all cases leveraging the high decorrelation of neuron motifs found in the pre-decision space and cross-layer deconv dependency. Moreover, we examine our approach's potential in network architecture search for specific tasks and analyze the influence of our pruning on model robustness to noises and adversarial attacks. Experimental results on datasets of generic objects (ImageNet, CIFAR100) as well as domain specific tasks (Adience, and LFWA) illustrate our framework's superior performance over state-of-the-art pruning approaches and fixed compact nets (e.g. SqueezeNet, MobileNet). The proposed method successfully maintains comparable accuracies even after discarding most parameters (98%-99% for VGG16, up to 82% for the already compact InceptionNet) and with significant FLOP reductions (83% for VGG16, up to 64% for InceptionNet). Through pruning, we can also derive smaller, but more accurate and more robust models suitable for the task.
CVApr 20, 2017
Efficient Gender Classification Using a Deep LDA-Pruned NetQing Tian, Tal Arbel, James J. Clark
Many real-time tasks, such as human-computer interaction, require fast and efficient facial gender classification. Although deep CNN nets have been very effective for a multitude of classification tasks, their high space and time demands make them impractical for personal computers and mobile devices without a powerful GPU. In this paper, we develop a 16-layer, yet lightweight, neural network which boosts efficiency while maintaining high accuracy. Our net is pruned from the VGG-16 model starting from the last convolutional (conv) layer where we find neuron activations are highly uncorrelated given the gender. Through Fisher's Linear Discriminant Analysis (LDA), we show that this high decorrelation makes it safe to discard directly last conv layer neurons with high within-class variance and low between-class variance. Combined with either Support Vector Machines (SVM) or Bayesian classification, the reduced CNNs are capable of achieving comparable (or even higher) accuracies on the LFW and CelebA datasets than the original net with fully connected layers. On LFW, only four Conv5_3 neurons are able to maintain a comparably high recognition accuracy, which results in a reduction of total network size by a factor of 70X with a 11 fold speedup. Comparisons with a state-of-the-art pruning method as well as two smaller nets in terms of accuracy loss and convolutional layers pruning rate are also provided.
CVSep 14, 2016
Joint Gender Classification and Age Estimation by Nearly Orthogonalizing Their Semantic SpacesQing Tian, Songcan Chen
In human face-based biometrics, gender classification and age estimation are two typical learning tasks. Although a variety of approaches have been proposed to handle them, just a few of them are solved jointly, even so, these joint methods do not yet specifically concern the semantic difference between human gender and age, which is intuitively helpful for joint learning, consequently leaving us a room of further improving the performance. To this end, in this work we firstly propose a general learning framework for jointly estimating human gender and age by specially attempting to formulate such semantic relationships as a form of near-orthogonality regularization and then incorporate it into the objective of the joint learning framework. In order to evaluate the effectiveness of the proposed framework, we exemplify it by respectively taking the widely used binary-class SVM for gender classification, and two threshold-based ordinal regression methods (i.e., the discriminant learning for ordinal regression and support vector ordinal regression) for age estimation, and crucially coupling both through the proposed semantic formulation. Moreover, we develop its kernelized nonlinear counterpart by deriving a representer theorem for the joint learning strategy. Finally, through extensive experiments on three aging datasets FG-NET, Morph Album I and Morph Album II, we demonstrate the effectiveness and superiority of the proposed joint learning strategy.
CVSep 13, 2016
A Unified Gender-Aware Age EstimationQing Tian, Songcan Chen, Xiaoyang Tan
Human age estimation has attracted increasing researches due to its wide applicability in such as security monitoring and advertisement recommendation. Although a variety of methods have been proposed, most of them focus only on the age-specific facial appearance. However, biological researches have shown that not only gender but also the aging difference between the male and the female inevitably affect the age estimation. To our knowledge, so far there have been two methods that have concerned the gender factor. The first is a sequential method which first classifies the gender and then performs age estimation respectively for classified male and female. Although it promotes age estimation performance because of its consideration on the gender semantic difference, an accumulation risk of estimation errors is unavoidable. To overcome drawbacks of the sequential strategy, the second is to regress the age appended with the gender by concatenating their labels as two dimensional output using Partial Least Squares (PLS). Although leading to promotion of age estimation performance, such a concatenation not only likely confuses the semantics between the gender and age, but also ignores the aging discrepancy between the male and the female. In order to overcome their shortcomings, in this paper we propose a unified framework to perform gender-aware age estimation. The proposed method considers and utilizes not only the semantic relationship between the gender and the age, but also the aging discrepancy between the male and the female. Finally, experimental results demonstrate not only the superiority of our method in performance, but also its good interpretability in revealing the aging discrepancy.