CVMay 12, 2022
Knowledge Distillation for Multi-Target Domain Adaptation in Real-Time Person Re-IdentificationFélix Remigereau, Djebril Mekhazni, Sajjad Abdoli et al.
Despite the recent success of deep learning architectures, person re-identification (ReID) remains a challenging problem in real-word applications. Several unsupervised single-target domain adaptation (STDA) methods have recently been proposed to limit the decline in ReID accuracy caused by the domain shift that typically occurs between source and target video data. Given the multimodal nature of person ReID data (due to variations across camera viewpoints and capture conditions), training a common CNN backbone to address domain shifts across multiple target domains, can provide an efficient solution for real-time ReID applications. Although multi-target domain adaptation (MTDA) has not been widely addressed in the ReID literature, a straightforward approach consists in blending different target datasets, and performing STDA on the mixture to train a common CNN. However, this approach may lead to poor generalization, especially when blending a growing number of distinct target domains to train a smaller CNN. To alleviate this problem, we introduce a new MTDA method based on knowledge distillation (KD-ReID) that is suitable for real-time person ReID applications. Our method adapts a common lightweight student backbone CNN over the target domains by alternatively distilling from multiple specialized teacher CNNs, each one adapted on data from a specific target domain. Extensive experiments conducted on several challenging person ReID datasets indicate that our approach outperforms state-of-art methods for MTDA, including blending methods, particularly when training a compact CNN backbone like OSNet. Results suggest that our flexible MTDA approach can be employed to design cost-effective ReID systems for real-time video surveillance applications.
CVMar 7, 2022
Dynamic Template Selection Through Change Detection for Adaptive Siamese TrackingMadhu Kiran, Le Thanh Nguyen-Meidine, Rajat Sahay et al.
Deep Siamese trackers have recently gained much attention in recent years since they can track visual objects at high speeds. Additionally, adaptive tracking methods, where target samples collected by the tracker are employed for online learning, have achieved state-of-the-art accuracy. However, single object tracking (SOT) remains a challenging task in real-world application due to changes and deformations in a target object's appearance. Learning on all the collected samples may lead to catastrophic forgetting, and thereby corrupt the tracking model. In this paper, SOT is formulated as an online incremental learning problem. A new method is proposed for dynamic sample selection and memory replay, preventing template corruption. In particular, we propose a change detection mechanism to detect gradual changes in object appearance and select the corresponding samples for online adaption. In addition, an entropy-based sample selection strategy is introduced to maintain a diversified auxiliary buffer for memory replay. Our proposed method can be integrated into any object tracking algorithm that leverages online learning for model adaptation. Extensive experiments conducted on the OTB-100, LaSOT, UAV123, and TrackingNet datasets highlight the cost-effectiveness of our method, along with the contribution of its key components. Results indicate that integrating our proposed method into state-of-art adaptive Siamese trackers can increase the potential benefits of a template update strategy, and significantly improve performance.
CVJul 14, 2020Code
Unsupervised Multi-Target Domain Adaptation Through Knowledge DistillationLe Thanh Nguyen-Meidine, Atif Belal, Madhu Kiran et al.
Unsupervised domain adaptation (UDA) seeks to alleviate the problem of domain shift between the distribution of unlabeled data from the target domain w.r.t. labeled data from the source domain. While the single-target UDA scenario is well studied in the literature, Multi-Target Domain Adaptation (MTDA) remains largely unexplored despite its practical importance, e.g., in multi-camera video-surveillance applications. The MTDA problem can be addressed by adapting one specialized model per target domain, although this solution is too costly in many real-world applications. Blending multiple targets for MTDA has been proposed, yet this solution may lead to a reduction in model specificity and accuracy. In this paper, we propose a novel unsupervised MTDA approach to train a CNN that can generalize well across multiple target domains. Our Multi-Teacher MTDA (MT-MTDA) method relies on multi-teacher knowledge distillation (KD) to iteratively distill target domain knowledge from multiple teachers to a common student. The KD process is performed in a progressive manner, where the student is trained by each teacher on how to perform UDA for a specific target, instead of directly learning domain adapted features. Finally, instead of combining the knowledge from each teacher, MT-MTDA alternates between teachers that distill knowledge, thereby preserving the specificity of each target (teacher) when learning to adapt to the student. MT-MTDA is compared against state-of-the-art methods on several challenging UDA benchmarks, and empirical results show that our proposed model can provide a considerably higher level of accuracy across multiple target domains. Our code is available at: https://github.com/LIVIAETS/MT-MTDA
CVFeb 21, 2022
Generative Target Update for Adaptive Siamese TrackingMadhu Kiran, Le Thanh Nguyen-Meidine, Rajat Sahay et al.
Siamese trackers perform similarity matching with templates (i.e., target models) to recursively localize objects within a search region. Several strategies have been proposed in the literature to update a template based on the tracker output, typically extracted from the target search region in the current frame, and thereby mitigate the effects of target drift. However, this may lead to corrupted templates, limiting the potential benefits of a template update strategy. This paper proposes a model adaptation method for Siamese trackers that uses a generative model to produce a synthetic template from the object search regions of several previous frames, rather than directly using the tracker output. Since the search region encompasses the target, attention from the search region is used for robust model adaptation. In particular, our approach relies on an auto-encoder trained through adversarial learning to detect changes in a target object's appearance and predict a future target template, using a set of target templates localized from tracker outputs at previous frames. To prevent template corruption during the update, the proposed tracker also performs change detection using the generative model to suspend updates until the tracker stabilizes, and robust matching can resume through dynamic template fusion. Extensive experiments conducted on VOT-16, VOT-17, OTB-50, and OTB-100 datasets highlight the effectiveness of our method, along with the impact of its key components. Results indicate that our proposed approach can outperform state-of-art trackers, and its overall robustness allows tracking for a longer time before failure.
CVApr 13, 2021
Holistic Guidance for Occluded Person Re-IdentificationMadhu Kiran, R Gnana Praveen, Le Thanh Nguyen-Meidine et al.
In real-world video surveillance applications, person re-identification (ReID) suffers from the effects of occlusions and detection errors. Despite recent advances, occlusions continue to corrupt the features extracted by state-of-art CNN backbones, and thereby deteriorate the accuracy of ReID systems. To address this issue, methods in the literature use an additional costly process such as pose estimation, where pose maps provide supervision to exclude occluded regions. In contrast, we introduce a novel Holistic Guidance (HG) method that relies only on person identity labels, and on the distribution of pairwise matching distances of datasets to alleviate the problem of occlusion, without requiring additional supervision. Hence, our proposed student-teacher framework is trained to address the occlusion problem by matching the distributions of between- and within-class distances (DCDs) of occluded samples with that of holistic (non-occluded) samples, thereby using the latter as a soft labeled reference to learn well separated DCDs. This approach is supported by our empirical study where the distribution of between- and within-class distances between images have more overlap in occluded than holistic datasets. In particular, features extracted from both datasets are jointly learned using the student model to produce an attention map that allows separating visible regions from occluded ones. In addition to this, a joint generative-discriminative backbone is trained with a denoising autoencoder, allowing the system to self-recover from occlusions. Extensive experiments on several challenging public datasets indicate that the proposed approach can outperform state-of-the-art methods on both occluded and holistic datasets
CVApr 13, 2021
Incremental Multi-Target Domain Adaptation for Object Detection with Efficient Domain TransferLe Thanh Nguyen-Meidine, Madhu Kiran, Marco Pedersoli et al.
Recent advances in unsupervised domain adaptation have significantly improved the recognition accuracy of CNNs by alleviating the domain shift between (labeled) source and (unlabeled) target data distributions. While the problem of single-target domain adaptation (STDA) for object detection has recently received much attention, multi-target domain adaptation (MTDA) remains largely unexplored, despite its practical relevance in several real-world applications, such as multi-camera video surveillance. Compared to the STDA problem that may involve large domain shifts between complex source and target distributions, MTDA faces additional challenges, most notably the computational requirements and catastrophic forgetting of previously-learned targets, which can depend on the order of target adaptations. STDA for detection can be applied to MTDA by adapting one model per target, or one common model with a mixture of data from target domains. However, these approaches are either costly or inaccurate. The only state-of-art MTDA method specialized for detection learns targets incrementally, one target at a time, and mitigates the loss of knowledge by using a duplicated detection model for knowledge distillation, which is computationally expensive and does not scale well to many domains. In this paper, we introduce an efficient approach for incremental learning that generalizes well to multiple target domains. Our MTDA approach is more suitable for real-world applications since it allows updating the detection model incrementally, without storing data from previous-learned target domains, nor retraining when a new target domain becomes available. Our proposed method, MTDA-DTM, achieved the highest level of detection accuracy compared against state-of-the-art approaches on several MTDA detection benchmarks and Wildtrack, a benchmark for multi-camera pedestrian detection.
CVJan 18, 2021
Knowledge Distillation Methods for Efficient Unsupervised Adaptation Across Multiple DomainsLe Thanh Nguyen-Meidine, Atif Belal, Madhu Kiran et al.
Beyond the complexity of CNNs that require training on large annotated datasets, the domain shift between design and operational data has limited the adoption of CNNs in many real-world applications. For instance, in person re-identification, videos are captured over a distributed set of cameras with non-overlapping viewpoints. The shift between the source (e.g. lab setting) and target (e.g. cameras) domains may lead to a significant decline in recognition accuracy. Additionally, state-of-the-art CNNs may not be suitable for such real-time applications given their computational requirements. Although several techniques have recently been proposed to address domain shift problems through unsupervised domain adaptation (UDA), or to accelerate/compress CNNs through knowledge distillation (KD), we seek to simultaneously adapt and compress CNNs to generalize well across multiple target domains. In this paper, we propose a progressive KD approach for unsupervised single-target DA (STDA) and multi-target DA (MTDA) of CNNs. Our method for KD-STDA adapts a CNN to a single target domain by distilling from a larger teacher CNN, trained on both target and source domain data in order to maintain its consistency with a common representation. Our proposed approach is compared against state-of-the-art methods for compression and STDA of CNNs on the Office31 and ImageClef-DA image classification datasets. It is also compared against state-of-the-art methods for MTDA on Digits, Office31, and OfficeHome. In both settings -- KD-STDA and KD-MTDA -- results indicate that our approach can achieve the highest level of accuracy across target domains, while requiring a comparable or lower CNN complexity.
LGMay 16, 2020
Joint Progressive Knowledge Distillation and Unsupervised Domain AdaptationLe Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran et al.
Currently, the divergence in distributions of design and operational data, and large computational complexity are limiting factors in the adoption of CNNs in real-world applications. For instance, person re-identification systems typically rely on a distributed set of cameras, where each camera has different capture conditions. This can translate to a considerable shift between source (e.g. lab setting) and target (e.g. operational camera) domains. Given the cost of annotating image data captured for fine-tuning in each target domain, unsupervised domain adaptation (UDA) has become a popular approach to adapt CNNs. Moreover, state-of-the-art deep learning models that provide a high level of accuracy often rely on architectures that are too complex for real-time applications. Although several compression and UDA approaches have recently been proposed to overcome these limitations, they do not allow optimizing a CNN to simultaneously address both. In this paper, we propose an unexplored direction -- the joint optimization of CNNs to provide a compressed model that is adapted to perform well for a given target domain. In particular, the proposed approach performs unsupervised knowledge distillation (KD) from a complex teacher model to a compact student model, by leveraging both source and target data. It also improves upon existing UDA techniques by progressively teaching the student about domain-invariant features, instead of directly adapting a compact model on target domain data. Our method is compared against state-of-the-art compression and UDA techniques, using two popular classification datasets for UDA -- Office31 and ImageClef-DA. In both datasets, results indicate that our method can achieve the highest level of accuracy while requiring a comparable or lower time complexity.
CVOct 31, 2019
On the Interaction Between Deep Detectors and Siamese Trackers in Video SurveillanceMadhu Kiran, Vivek Tiwari, Le Thanh Nguyen-Meidine et al.
Visual object tracking is an important function in many real-time video surveillance applications, such as localization and spatio-temporal recognition of persons. In real-world applications, an object detector and tracker must interact on a periodic basis to discover new objects, and thereby to initiate tracks. Periodic interactions with the detector can also allow the tracker to validate and/or update its object template with new bounding boxes. However, bounding boxes provided by a state-of-the-art detector are noisy, due to changes in appearance, background and occlusion, which can cause the tracker to drift. Moreover, CNN-based detectors can provide a high level of accuracy at the expense of computational complexity, so interactions should be minimized for real-time applications. In this paper, a new approach is proposed to manage detector-tracker interactions for trackers from the Siamese-FC family. By integrating a change detection mechanism into a deep Siamese-FC tracker, its template can be adapted in response to changes in a target's appearance that lead to drifts during tracking. An abrupt change detection triggers an update of tracker template using the bounding box produced by the detector, while in the case of a gradual change, the detector is used to update an evolving set of templates for robust matching. Experiments were performed using state-of-the-art Siamese-FC trackers and the YOLOv3 detector on a subset of videos from the OTB-100 dataset that mimic video surveillance scenarios. Results highlight the importance for reliable VOT of using accurate detectors. They also indicate that our adaptive Siamese trackers are robust to noisy object detections, and can significantly improve the performance of Siamese-FC tracking.
CVJul 4, 2019
Exploiting Prunability for Person Re-IdentificationHugo Masson, Amran Bhuiyan, Le Thanh Nguyen-Meidine et al.
Recent years have witnessed a substantial increase in the deep learning (DL)architectures proposed for visual recognition tasks like person re-identification,where individuals must be recognized over multiple distributed cameras. Althoughthese architectures have greatly improved the state-of-the-art accuracy, thecomputational complexity of the CNNs commonly used for feature extractionremains an issue, hindering their deployment on platforms with limited resources,or in applications with real-time constraints. There is an obvious advantage toaccelerating and compressing DL models without significantly decreasing theiraccuracy. However, the source (pruning) domain differs from operational (target)domains, and the domain shift between image data captured with differentnon-overlapping camera viewpoints leads to lower recognition accuracy. In thispaper, we investigate the prunability of these architectures under different designscenarios. This paper first revisits pruning techniques that are suitable forreducing the computational complexity of deep CNN networks applied to personre-identification. Then, these techniques are analysed according to their pruningcriteria and strategy, and according to different scenarios for exploiting pruningmethods to fine-tuning networks to target domains. Experimental resultsobtained using DL models with ResNet feature extractors, and multiplebenchmarks re-identification datasets, indicate that pruning can considerablyreduce network complexity while maintaining a high level of accuracy. Inscenarios where pruning is performed with large pre-training or fine-tuningdatasets, the number of FLOPS required by ResNet architectures is reduced byhalf, while maintaining a comparable rank-1 accuracy (within 1% of the originalmodel). Pruning while training a larger CNNs can also provide a significantlybetter performance than fine-tuning smaller ones.
LGJun 20, 2019
Progressive Gradient Pruning for Classification, Detection and DomainAdaptationLe Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran et al.
Although deep neural networks (NNs) have achievedstate-of-the-art accuracy in many visual recognition tasks,the growing computational complexity and energy con-sumption of networks remains an issue, especially for ap-plications on platforms with limited resources and requir-ing real-time processing. Filter pruning techniques haverecently shown promising results for the compression andacceleration of convolutional NNs (CNNs). However, thesetechniques involve numerous steps and complex optimisa-tions because some only prune after training CNNs, whileothers prune from scratch during training by integratingsparsity constraints or modifying the loss function.In this paper we propose a new Progressive GradientPruning (PGP) technique for iterative filter pruning dur-ing training. In contrast to previous progressive pruningtechniques, it relies on a novel filter selection criterion thatmeasures the change in filter weights, uses a new hard andsoft pruning strategy and effectively adapts momentum ten-sors during the backward propagation pass. Experimentalresults obtained after training various CNNs on image datafor classification, object detection and domain adaptationbenchmarks indicate that the PGP technique can achievea better trade-off between classification accuracy and net-work (time and memory) complexity than PSFP and otherstate-of-the-art filter pruning techniques.
CVSep 10, 2018
A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance ApplicationsLe Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran et al.
Detecting faces and heads appearing in video feeds are challenging tasks in real-world video surveillance applications due to variations in appearance, occlusions and complex backgrounds. Recently, several CNN architectures have been proposed to increase the accuracy of detectors, although their computational complexity can be an issue, especially for real-time applications, where faces and heads must be detected live using high-resolution cameras. This paper compares the accuracy and complexity of state-of-the-art CNN architectures that are suitable for face and head detection. Single pass and region-based architectures are reviewed and compared empirically to baseline techniques according to accuracy and to time and memory complexity on images from several challenging datasets. The viability of these architectures is analyzed with real-time video surveillance applications in mind. Results suggest that, although CNN architectures can achieve a very high level of accuracy compared to traditional detectors, their computational cost can represent a limitation for many practical real-time applications.