CVNov 3, 2022
Progressive Transformation Learning for Leveraging Virtual Images in TrainingYi-Ting Shen, Hyungtae Lee, Heesung Kwon et al.
To effectively interrogate UAV-based images for detecting objects of interest, such as humans, it is essential to acquire large-scale UAV-based datasets that include human instances with various poses captured from widely varying viewing angles. As a viable alternative to laborious and costly data curation, we introduce Progressive Transformation Learning (PTL), which gradually augments a training dataset by adding transformed virtual images with enhanced realism. Generally, a virtual2real transformation generator in the conditional GAN framework suffers from quality degradation when a large domain gap exists between real and virtual images. To deal with the domain gap, PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool. In PTL, accurately quantifying the domain gap is critical. To do that, we theoretically demonstrate that the feature representation space of a given object detector can be modeled as a multivariate Gaussian distribution from which the Mahalanobis distance between a virtual object and the Gaussian distribution of each object category in the representation space can be readily computed. Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime.
CVApr 7, 2022
Exploring Cross-Domain Pretrained Model for Hyperspectral Image ClassificationHyungtae Lee, Sungmin Eum, Heesung Kwon
A pretrain-finetune strategy is widely used to reduce the overfitting that can occur when data is insufficient for CNN training. First few layers of a CNN pretrained on a large-scale RGB dataset are capable of acquiring general image characteristics which are remarkably effective in tasks targeted for different RGB datasets. However, when it comes down to hyperspectral domain where each domain has its unique spectral properties, the pretrain-finetune strategy no longer can be deployed in a conventional way while presenting three major issues: 1) inconsistent spectral characteristics among the domains (e.g., frequency range), 2) inconsistent number of data channels among the domains, and 3) absence of large-scale hyperspectral dataset. We seek to train a universal cross-domain model which can later be deployed for various spectral domains. To achieve, we physically furnish multiple inlets to the model while having a universal portion which is designed to handle the inconsistent spectral characteristics among different domains. Note that only the universal portion is used in the finetune process. This approach naturally enables the learning of our model on multiple domains simultaneously which acts as an effective workaround for the issue of the absence of large-scale dataset. We have carried out a study to extensively compare models that were trained using cross-domain approach with ones trained from scratch. Our approach was found to be superior both in accuracy and in training efficiency. In addition, we have verified that our approach effectively reduces the overfitting issue, enabling us to deepen the model up to 13 layers (from 9) without compromising the accuracy.
CVApr 6, 2022
DBF: Dynamic Belief Fusion for Combining Multiple Object DetectorsHyungtae Lee, Heesung Kwon
In this paper, we propose a novel and highly practical score-level fusion approach called dynamic belief fusion (DBF) that directly integrates inference scores of individual detections from multiple object detection methods. To effectively integrate the individual outputs of multiple detectors, the level of ambiguity in each detection score is estimated using a confidence model built on a precision-recall relationship of the corresponding detector. For each detector output, DBF then calculates the probabilities of three hypotheses (target, non-target, and intermediate state (target or non-target)) based on the confidence level of the detection score conditioned on the prior confidence model of individual detectors, which is referred to as basic probability assignment. The probability distributions over three hypotheses of all the detectors are optimally fused via the Dempster's combination rule. Experiments on the ARL, PASCAL VOC 07, and 12 datasets show that the detection accuracy of the DBF is significantly higher than any of the baseline fusion approaches as well as individual detectors used for the fusion.
CVApr 14, 2023
NEV-NCD: Negative Learning, Entropy, and Variance regularization based novel action categories discoveryZahid Hasan, Masud Ahmed, Abu Zaher Md Faridee et al.
Novel Categories Discovery (NCD) facilitates learning from a partially annotated label space and enables deep learning (DL) models to operate in an open-world setting by identifying and differentiating instances of novel classes based on the labeled data notions. One of the primary assumptions of NCD is that the novel label space is perfectly disjoint and can be equipartitioned, but it is rarely realized by most NCD approaches in practice. To better align with this assumption, we propose a novel single-stage joint optimization-based NCD method, Negative learning, Entropy, and Variance regularization NCD (NEV-NCD). We demonstrate the efficacy of NEV-NCD in previously unexplored NCD applications of video action recognition (VAR) with the public UCF101 dataset and a curated in-house partial action-space annotated multi-view video dataset. We perform a thorough ablation study by varying the composition of final joint loss and associated hyper-parameters. During our experiments with UCF101 and multi-view action dataset, NEV-NCD achieves ~ 83% classification accuracy in test instances of labeled data. NEV-NCD achieves ~ 70% clustering accuracy over unlabeled data outperforming both naive baselines (by ~ 40%) and state-of-the-art pseudo-labeling-based approaches (by ~ 3.5%) over both datasets. Further, we propose to incorporate optional view-invariant feature learning with the multiview dataset to identify novel categories from novel viewpoints. Our additional view-invariance constraint improves the discriminative accuracy for both known and unknown categories by ~ 10% for novel viewpoints.
CVJul 20, 2022
Negative Samples are at Large: Leveraging Hard-distance Elastic Loss for Re-identificationHyungtae Lee, Sungmin Eum, Heesung Kwon
We present a Momentum Re-identification (MoReID) framework that can leverage a very large number of negative samples in training for general re-identification task. The design of this framework is inspired by Momentum Contrast (MoCo), which uses a dictionary to store current and past batches to build a large set of encoded samples. As we find it less effective to use past positive samples which may be highly inconsistent to the encoded feature property formed with the current positive samples, MoReID is designed to use only a large number of negative samples stored in the dictionary. However, if we train the model using the widely used Triplet loss that uses only one sample to represent a set of positive/negative samples, it is hard to effectively leverage the enlarged set of negative samples acquired by the MoReID framework. To maximize the advantage of using the scaled-up negative sample set, we newly introduce Hard-distance Elastic loss (HE loss), which is capable of using more than one hard sample to represent a large number of samples. Our experiments demonstrate that a large number of negative samples provided by MoReID framework can be utilized at full capacity only with the HE loss, achieving the state-of-the-art accuracy on three re-ID benchmarks, VeRi-776, Market-1501, and VeRi-Wild.
CVOct 25, 2023
UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based PerceptionChristopher Maxey, Jaehoon Choi, Hyungtae Lee et al.
Tremendous variations coupled with large degrees of freedom in UAV-based imaging conditions lead to a significant lack of data in adequately learning UAV-based perception models. Using various synthetic renderers in conjunction with perception models is prevalent to create synthetic data to augment the learning in the ground-based imaging domain. However, severe challenges in the austere UAV-based domain require distinctive solutions to image synthesis for data augmentation. In this work, we leverage recent advancements in neural rendering to improve static and dynamic novelview UAV-based image synthesis, especially from high altitudes, capturing salient scene attributes. Finally, we demonstrate a considerable performance boost is achieved when a state-ofthe-art detection model is optimized primarily on hybrid sets of real and synthetic data instead of the real or synthetic data separately.
CVJul 7, 2023
Novel Categories Discovery Via Constraints on Empirical Prediction StatisticsZahid Hasan, Abu Zaher Md Faridee, Masud Ahmed et al.
Novel Categories Discovery (NCD) aims to cluster novel data based on the class semantics of known classes using the open-world partial class space annotated dataset. As an alternative to the traditional pseudo-labeling-based approaches, we leverage the connection between the data sampling and the provided multinoulli (categorical) distribution of novel classes. We introduce constraints on individual and collective statistics of predicted novel class probabilities to implicitly achieve semantic-based clustering. More specifically, we align the class neuron activation distributions under Monte-Carlo sampling of novel classes in large batches by matching their empirical first-order (mean) and second-order (covariance) statistics with the multinoulli distribution of the labels while applying instance information constraints and prediction consistency under label-preserving augmentations. We then explore a directional statistics-based probability formation that learns the mixture of Von Mises-Fisher distribution of class labels in a unit hypersphere. We demonstrate the discriminative ability of our approach to realize semantic clustering of novel samples in image, video, and time-series modalities. We perform extensive ablation studies regarding data, networks, and framework components to provide better insights. Our approach maintains 94%, 93%, 85%, and 93% (approx.) classification accuracy in labeled data while achieving 90%, 84%, 72%, and 75% (approx.) clustering accuracy for novel categories in Cifar10, UCF101, MPSC-ARL, and SHAR datasets that match state-of-the-art approaches without any external clustering.
CVAug 26, 2024
Exploring the Potential of Synthetic Data to Replace Real DataHyungtae Lee, Yan Zhang, Heesung Kwon et al.
The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and $\text{AP}_\text{t2t}$, to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.
CVAug 21, 2024
SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View PerceptionJinsub Yim, Hyungtae Lee, Sungmin Eum et al.
We introduce SynPlay, a large-scale synthetic human dataset purpose-built for advancing multi-perspective human localization, with a predominant focus on aerial-view perception. SynPlay departs from traditional synthetic datasets by addressing a critical but underexplored challenge: localizing humans in aerial scenes where subjects often occupy only tens of pixels in the image. In such scenarios, fine-grained details like facial features or textures become irrelevant, shifting the burden of recognition to human motion, behavior, and interactions. To meet this need, SynPlay implements a novel rule-guided motion generation framework that combines real-world motion capture with motion evolution graphs. This design enables human actions to evolve dynamically through high-level game rules rather than predefined scripts, resulting in effectively uncountable motion variations. Unlike existing synthetic datasets-which either focus on static visual traits or reuse a limited set of mocap-driven actions-SynPlay captures a wide spectrum of spontaneous behaviors, including complex interactions that naturally emerge from unscripted gameplay scenarios. SynPlay also introduces an extensive multi-camera setup that spans UAVs at random altitudes, CCTVs, and a freely roaming UGV, achieving true near-to-far perspective coverage in a single dataset. The majority of instances are captured from aerial viewpoints at varying scales, directly supporting the development of models for long-range human analysis-a setting where existing datasets fall short. Our data contains over 73k images and 6.5M human instances, with detailed annotations for detection, segmentation, and keypoint tasks. Extensive experiments demonstrate that training with SynPlay significantly improves human localization performance, especially in few-shot and data-scarce scenarios.
CVOct 11, 2024
MeshGS: Adaptive Mesh-Aligned Gaussian Splatting for High-Quality RenderingJaehoon Choi, Yonghan Lee, Hyungtae Lee et al.
Recently, 3D Gaussian splatting has gained attention for its capability to generate high-fidelity rendering results. At the same time, most applications such as games, animation, and AR/VR use mesh-based representations to represent and render 3D scenes. We propose a novel approach that integrates mesh representation with 3D Gaussian splats to perform high-quality rendering of reconstructed real-world scenes. In particular, we introduce a distance-based Gaussian splatting technique to align the Gaussian splats with the mesh surface and remove redundant Gaussian splats that do not contribute to the rendering. We consider the distance between each Gaussian splat and the mesh surface to distinguish between tightly-bound and loosely-bound Gaussian splats. The tightly-bound splats are flattened and aligned well with the mesh geometry. The loosely-bound Gaussian splats are used to account for the artifacts in reconstructed 3D meshes in terms of rendering. We present a training strategy of binding Gaussian splats to the mesh geometry, and take into account both types of splats. In this context, we introduce several regularization techniques aimed at precisely aligning tightly-bound Gaussian splats with the mesh surface during the training process. We validate the effectiveness of our method on large and unbounded scene from mip-NeRF 360 and Deep Blending datasets. Our method surpasses recent mesh-based neural rendering techniques by achieving a 2dB higher PSNR, and outperforms mesh-based Gaussian splatting methods by 1.3 dB PSNR, particularly on the outdoor mip-NeRF 360 dataset, demonstrating better rendering quality. We provide analyses for each type of Gaussian splat and achieve a reduction in the number of Gaussian splats by 30% compared to the original 3D Gaussian splatting.
CVMay 24, 2024
Diversifying Human Pose in Synthetic Data for Aerial-view Human DetectionYi-Ting Shen, Hyungtae Lee, Heesung Kwon et al.
Synthetic data generation has emerged as a promising solution to the data scarcity issue in aerial-view human detection. However, creating datasets that accurately reflect varying real-world human appearances, particularly diverse poses, remains challenging and labor-intensive. To address this, we propose SynPoseDiv, a novel framework that diversifies human poses within existing synthetic datasets. SynPoseDiv tackles two key challenges: generating realistic, diverse 3D human poses using a diffusion-based pose generator, and producing images of virtual characters in novel poses through a source-to-target image translator. The framework incrementally transitions characters into new poses using optimized pose sequences identified via Dijkstra's algorithm. Experiments demonstrate that SynPoseDiv significantly improves detection accuracy across multiple aerial-view human detection benchmarks, especially in low-shot scenarios, and remains effective regardless of the training approach or dataset size.
CVMay 4, 2024
TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based ScenesChristopher Maxey, Jaehoon Choi, Yonghan Lee et al.
In this paper, we present a new approach to bridge the domain gap between synthetic and real-world data for unmanned aerial vehicle (UAV)-based perception. Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm stores a set of tiered feature vectors. The tiered feature vectors are generated to effectively model conceptual information about a scene as well as an image decoder that transforms output feature maps into RGB images. Our technique leverages the information amongst both static and dynamic objects within a scene and is able to capture salient scene attributes of high altitude videos. We evaluate its performance on challenging datasets, including Okutama Action and UG2, and observe considerable improvement in accuracy over state of the art neural rendering methods.
CVFeb 8, 2022
Self-supervised Contrastive Learning for Cross-domain Hyperspectral Image RepresentationHyungtae Lee, Heesung Kwon
Recently, self-supervised learning has attracted attention due to its remarkable ability to acquire meaningful representations for classification tasks without using semantic labels. This paper introduces a self-supervised learning framework suitable for hyperspectral images that are inherently challenging to annotate. The proposed framework architecture leverages cross-domain CNN, allowing for learning representations from different hyperspectral images with varying spectral characteristics and no pixel-level annotation. In the framework, cross-domain representations are learned via contrastive learning where neighboring spectral vectors in the same image are clustered together in a common representation space encompassing multiple hyperspectral images. In contrast, spectral vectors in different hyperspectral images are separated into distinct clusters in the space. To verify that the learned representation through contrastive learning is effectively transferred into a downstream task, we perform a classification task on hyperspectral images. The experimental results demonstrate the advantage of the proposed self-supervised representation over models trained from scratch or other transfer learning methods.
CVFeb 11, 2019
S-DOD-CNN: Doubly Injecting Spatially-Preserved Object Information for Event RecognitionHyungtae Lee, Sungmin Eum, Heesung Kwon
We present a novel event recognition approach called Spatially-preserved Doubly-injected Object Detection CNN (S-DOD-CNN), which incorporates the spatially preserved object detection information in both a direct and an indirect way. Indirect injection is carried out by simply sharing the weights between the object detection modules and the event recognition module. Meanwhile, our novelty lies in the fact that we have preserved the spatial information for the direct injection. Once multiple regions-of-intereset (RoIs) are acquired, their feature maps are computed and then projected onto a spatially-preserving combined feature map using one of the four RoI Projection approaches we present. In our architecture, combined feature maps are generated for object detection which are directly injected to the event recognition module. Our method provides the state-of-the-art accuracy for malicious event recognition.
CVJan 24, 2019
Is Pretraining Necessary for Hyperspectral Image Classification?Hyungtae Lee, Sungmin Eum, Heesung Kwon
We address two questions for training a convolutional neural network (CNN) for hyperspectral image classification: i) is it possible to build a pre-trained network? and ii) is the pre-training effective in furthering the performance? To answer the first question, we have devised an approach that pre-trains a network on multiple source datasets that differ in their hyperspectral characteristics and fine-tunes on a target dataset. This approach effectively resolves the architectural issue that arises when transferring meaningful information between the source and the target networks. To answer the second question, we carried out several ablation experiments. Based on the experimental results, a network trained from scratch performs as good as a network fine-tuned from a pre-trained network. However, we observed that pre-training the network has its own advantage in achieving better performances when deeper networks are required.
CVDec 13, 2018
Generating Hard Examples for Pixel-wise ClassificationHyungtae Lee, Heesung Kwon, Wonkook Kim
Pixel-wise classification in remote sensing identifies entities in large-scale satellite-based images at the pixel level. Few fully annotated large-scale datasets for pixel-wise classification exist due to the challenges of annotating individual pixels. Training data scarcity inevitably ensues from the annotation challenge, leading to overfitting classifiers and degraded classification performance. The lack of annotated pixels also necessarily results in few hard examples of various entities critical for generating a robust classification hyperplane. To overcome the problem of the data scarcity and lack of hard examples in training, we introduce a two-step hard example generation (HEG) approach that first generates hard example candidates and then mines actual hard examples. In the first step, a generator that creates hard example candidates is learned via the adversarial learning framework by fooling a discriminator and a pixel-wise classification model at the same time. In the second step, mining is performed to build a fixed number of hard examples from a large pool of real and artificially generated examples. To evaluate the effectiveness of the proposed HEG approach, we design a 9-layer fully convolutional network suitable for pixel-wise classification. Experiments show that using generated hard examples from the proposed HEG approach improves the pixel-wise classification model's accuracy on red tide detection and hyperspectral image classification tasks.
CVNov 7, 2018
DOD-CNN: Doubly-injecting Object Information for Event RecognitionHyungtae Lee, Sungmin Eum, Heesung Kwon
Recognizing an event in an image can be enhanced by detecting relevant objects in two ways: 1) indirectly utilizing object detection information within the unified architecture or 2) directly making use of the object detection output results. We introduce a novel approach, referred to as Doubly-injected Object Detection CNN (DOD-CNN), exploiting the object information in both ways for the task of event recognition. The structure of this network is inspired by the Integrated Object Detection CNN (IOD-CNN) where object information is indirectly exploited by the event recognition module through the shared portion of the network. In the DOD-CNN architecture, the intermediate object detection outputs are directly injected into the event recognition network while keeping the indirect sharing structure inherited from the IOD-CNN, thus being `doubly-injected'. We also introduce a batch pooling layer which constructs one representative feature map from multiple object hypotheses. We have demonstrated the effectiveness of injecting the object detection information in two different ways in the task of malicious event recognition.
CVJan 31, 2018
Cross-domain CNN for Hyperspectral Image ClassificationHyungtae Lee, Sungmin Eum, Heesung Kwon
In this paper, we address the dataset scarcity issue with the hyperspectral image classification. As only a few thousands of pixels are available for training, it is difficult to effectively learn high-capacity Convolutional Neural Networks (CNNs). To cope with this problem, we propose a novel cross-domain CNN containing the shared parameters which can co-learn across multiple hyperspectral datasets. The network also contains the non-shared portions designed to handle the dataset specific spectral characteristics and the associated classification tasks. Our approach is the first attempt to learn a CNN for multiple hyperspectral datasets, in an end-to-end fashion. Moreover, we have experimentally shown that the proposed network trained on three of the widely used datasets outperform all the baseline networks which are trained on single dataset.
CVApr 4, 2017
ME R-CNN: Multi-Expert R-CNN for Object DetectionHyungtae Lee, Sungmin Eum, Heesung Kwon
We introduce Multi-Expert Region-based Convolutional Neural Network (ME R-CNN) which is equipped with multiple experts (ME) where each expert is learned to process a certain type of regions of interest (RoIs). This architecture better captures the appearance variations of the RoIs caused by different shapes, poses, and viewing angles. In order to direct each RoI to the appropriate expert, we devise a novel "learnable" network, which we call, expert assignment network (EAN). EAN automatically learns the optimal RoI-expert relationship even without any supervision of expert assignment. As the major components of ME R-CNN, ME and EAN, are mutually affecting each other while tied to a shared network, neither an alternating nor a naive end-to-end optimization is likely to fail. To address this problem, we introduce a practical training strategy which is tailored to optimize ME, EAN, and the shared network in an end-to-end fashion. We show that both of the architectures provide considerable performance increase over the baselines on PASCAL VOC 07, 12, and MS COCO datasets.
CVMar 21, 2017
IOD-CNN: Integrating Object Detection Networks for Event RecognitionSungmin Eum, Hyungtae Lee, Heesung Kwon et al.
Many previous methods have showed the importance of considering semantically relevant objects for performing event recognition, yet none of the methods have exploited the power of deep convolutional neural networks to directly integrate relevant object information into a unified network. We present a novel unified deep CNN architecture which integrates architecturally different, yet semantically-related object detection networks to enhance the performance of the event recognition task. Our architecture allows the sharing of the convolutional layers and a fully connected layer which effectively integrates event recognition, rigid object detection and non-rigid object detection.
CVOct 21, 2016
Enhanced Object Detection via Fusion With Prior Beliefs from Image ClassificationYilun Cao, Hyungtae Lee, Heesung Kwon
In this paper, we introduce a novel fusion method that can enhance object detection performance by fusing decisions from two different types of computer vision tasks: object detection and image classification. In the proposed work, the class label of an image obtained from image classification is viewed as prior knowledge about existence or non-existence of certain objects. The prior knowledge is then fused with the decisions of object detection to improve detection accuracy by mitigating false positives of an object detector that are strongly contradicted with the prior knowledge. A recently introduced novel fusion approach called dynamic belief fusion (DBF) is used to fuse the detector output with the classification prior. Experimental results show that the detection performance of all the detection algorithms used in the proposed work is improved on benchmark datasets via the proposed fusion framework.
CVOct 21, 2016
Exploitation of Semantic Keywords for Malicious Event ClassificationHyungtae Lee, Sungmin Eum, Joel Levis et al.
Learning an event classifier is challenging when the scenes are semantically different but visually similar. However, as humans, we typically handle such tasks painlessly by adding our background semantic knowledge. Motivated by this observation, we aim to provide an empirical study about how additional information such as semantic keywords can boost up the discrimination of such events. To demonstrate the validity of this study, we first construct a novel Malicious Crowd Dataset containing crowd images with two events, benign and malicious, which look visually similar. Note that the primary focus of this paper is not to provide the state-of-the-art performance on this dataset but to show the beneficial aspects of using semantically-driven keyword information. By leveraging crowd-sourcing platforms, such as Amazon Mechanical Turk, we collect semantic keywords associated with images and then subsequently identify a subset of keywords (e.g. police, fire, etc.) unique to specific events. We first show that by using recently introduced attention models, a naive CNN-based event classifier actually learns to primarily focus on local attributes associated with the discriminant semantic keywords identified by the Turks. We further show that incorporating the keyword-driven information into early- and late-fusion approaches can significantly enhance malicious event classification.
CVApr 12, 2016
Going Deeper with Contextual CNN for Hyperspectral Image ClassificationHyungtae Lee, Heesung Kwon
In this paper, we describe a novel deep convolutional neural network (CNN) that is deeper and wider than other existing deep networks for hyperspectral image classification. Unlike current state-of-the-art approaches in CNN-based hyperspectral image classification, the proposed network, called contextual deep CNN, can optimally explore local contextual interactions by jointly exploiting local spatio-spectral relationships of neighboring individual pixel vectors. The joint exploitation of the spatio-spectral information is achieved by a multi-scale convolutional filter bank used as an initial component of the proposed CNN pipeline. The initial spatial and spectral feature maps obtained from the multi-scale filter bank are then combined together to form a joint spatio-spectral feature map. The joint feature map representing rich spectral and spatial properties of the hyperspectral image is then fed through a fully convolutional network that eventually predicts the corresponding label of each pixel vector. The proposed approach is tested on three benchmark datasets: the Indian Pines dataset, the Salinas dataset and the University of Pavia dataset. Performance comparison shows enhanced classification performance of the proposed approach over the current state-of-the-art on the three datasets.
CVApr 12, 2016
DTM: Deformable Template MatchingHyungtae Lee, Heesung Kwon, Ryan M. Robinson et al.
A novel template matching algorithm that can incorporate the concept of deformable parts, is presented in this paper. Unlike the deformable part model (DPM) employed in object recognition, the proposed template-matching approach called Deformable Template Matching (DTM) does not require a training step. Instead, deformation is achieved by a set of predefined basic rules (e.g. the left sub-patch cannot pass across the right patch). Experimental evaluation of this new method using the PASCAL VOC 07 dataset demonstrated substantial performance improvement over conventional template matching algorithms. Additionally, to confirm the applicability of DTM, the concept is applied to the generation of a rotation-invariant SIFT descriptor. Experimental evaluation employing deformable matching of SIFT features shows an increased number of matching features compared to a conventional SIFT matching.
CVApr 12, 2016
Fast Object Localization Using a CNN Feature Map Based Multi-Scale SearchHyungtae Lee, Heesung Kwon, Archith J. Bency et al.
Object localization is an important task in computer vision but requires a large amount of computational power due mainly to an exhaustive multiscale search on the input image. In this paper, we describe a near real-time multiscale search on a deep CNN feature map that does not use region proposals. The proposed approach effectively exploits local semantic information preserved in the feature map of the outermost convolutional layer. A multi-scale search is performed on the feature map by processing all the sub-regions of different sizes using separate expert units of fully connected layers. Each expert unit receives as input local semantic features only from the corresponding sub-regions of a specific geometric shape. Therefore, it contains more nearly optimal parameters tailored to the corresponding shape. This multi-scale and multi-aspect ratio scanning strategy can effectively localize a potential object of an arbitrary size. The proposed approach is fast and able to localize objects of interest with a frame rate of 4 fps while providing improved detection performance over the state-of-the art on the PASCAL VOC 12 and MSCOCO data sets.
CVMar 1, 2016
Weakly Supervised Localization using Deep Feature MapsArchith J. Bency, Heesung Kwon, Hyungtae Lee et al.
Object localization is an important computer vision problem with a variety of applications. The lack of large scale object-level annotations and the relative abundance of image-level labels makes a compelling case for weak supervision in the object localization task. Deep Convolutional Neural Networks are a class of state-of-the-art methods for the related problem of object recognition. In this paper, we describe a novel object localization algorithm which uses classification networks trained on only image labels. This weakly supervised method leverages local spatial and semantic patterns captured in the convolutional layers of classification networks. We propose an efficient beam search based approach to detect and localize multiple objects in images. The proposed method significantly outperforms the state-of-the-art in standard object localization data-sets with a 8 point increase in mAP scores.
CVNov 10, 2015
Dynamic Belief Fusion for Object DetectionHyungtae Lee, Heesung Kwon, Ryan M. Robinson et al.
A novel approach for the fusion of heterogeneous object detection methods is proposed. In order to effectively integrate the outputs of multiple detectors, the level of ambiguity in each individual detection score is estimated using the precision/recall relationship of the corresponding detector. The main contribution of the proposed work is a novel fusion method, called Dynamic Belief Fusion (DBF), which dynamically assigns probabilities to hypotheses (target, non-target, intermediate state (target or non-target)) based on confidence levels in the detection results conditioned on the prior performance of individual detectors. In DBF, a joint basic probability assignment, optimally fusing information from all detectors, is determined by the Dempster's combination rule, and is easily reduced to a single fused detection score. Experiments on ARL and PASCAL VOC 07 datasets demonstrate that the detection accuracy of DBF is considerably greater than conventional fusion approaches as well as individual detectors used for the fusion.