Jaeyoung Yoo

CV
h-index15
9papers
193citations
Novelty56%
AI Score31

9 Papers

CVMar 29, 2022Code
MatteFormer: Transformer-Based Image Matting via Prior-Tokens

GyuTae Park, SungJoon Son, JaeYoung Yoo et al.

In this paper, we propose a transformer-based image matting model called MatteFormer, which takes full advantage of trimap information in the transformer block. Our method first introduces a prior-token which is a global representation of each trimap region (e.g. foreground, background and unknown). These prior-tokens are used as global priors and participate in the self-attention mechanism of each block. Each stage of the encoder is composed of PAST (Prior-Attentive Swin Transformer) block, which is based on the Swin Transformer block, but differs in a couple of aspects: 1) It has PA-WSA (Prior-Attentive Window Self-Attention) layer, performing self-attention not only with spatial-tokens but also with prior-tokens. 2) It has prior-memory which saves prior-tokens accumulatively from the previous blocks and transfers them to the next block. We evaluate our MatteFormer on the commonly used image matting datasets: Composition-1k and Distinctions-646. Experiment results show that our proposed method achieves state-of-the-art performance with a large margin. Our codes are available at https://github.com/webtoon/matteformer.

CVFeb 17, 2023
MDPose: Real-Time Multi-Person Pose Estimation via Mixture Density Model

Seunghyeon Seo, Jaeyoung Yoo, Jihye Hwang et al.

One of the major challenges in multi-person pose estimation is instance-aware keypoint estimation. Previous methods address this problem by leveraging an off-the-shelf detector, heuristic post-grouping process or explicit instance identification process, hindering further improvements in the inference speed which is an important factor for practical applications. From the statistical point of view, those additional processes for identifying instances are necessary to bypass learning the high-dimensional joint distribution of human keypoints, which is a critical factor for another major challenge, the occlusion scenario. In this work, we propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints with a mixture density model, termed as MDPose. Our MDPose estimates the distribution of human keypoints' coordinates using a mixture density model with an instance-aware keypoint head consisting simply of 8 convolutional layers. It is trained by minimizing the negative log-likelihood of the ground truth keypoints. Also, we propose a simple yet effective training strategy, Random Keypoint Grouping (RKG), which significantly alleviates the underflow problem leading to successful learning of relations between keypoints. On OCHuman dataset, which consists of images with highly occluded people, our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints. Furthermore, our MDPose shows significant improvement in inference speed with a competitive accuracy on MS COCO, a widely-used human keypoint dataset, thanks to the proposed much simpler single-stage pipeline.

CVMay 18, 2022
End-to-End Multi-Object Detection with a Regularized Mixture Model

Jaeyoung Yoo, Hojun Lee, Seunghyeon Seo et al.

Recent end-to-end multi-object detectors simplify the inference pipeline by removing hand-crafted processes such as non-maximum suppression (NMS). However, during training, they still heavily rely on heuristics and hand-crafted processes which deteriorate the reliability of the predicted confidence score. In this paper, we propose a novel framework to train an end-to-end multi-object detector consisting of only two terms: negative log-likelihood (NLL) and a regularization term. In doing so, the multi-object detection problem is treated as density estimation of the ground truth bounding boxes utilizing a regularized mixture density model. The proposed \textit{end-to-end multi-object Detection with a Regularized Mixture Model} (D-RMM) is trained by minimizing the NLL with the proposed regularization term, maximum component maximization (MCM) loss, preventing duplicate predictions. Our method reduces the heuristics of the training process and improves the reliability of the predicted confidence score. Moreover, our D-RMM outperforms the previous end-to-end detectors on MS COCO dataset.

CVNov 28, 2019Code
Training Multi-Object Detector by Estimating Bounding Box Distribution for Input Image

Jaeyoung Yoo, Hojun Lee, Inseop Chung et al.

In multi-object detection using neural networks, the fundamental problem is, "How should the network learn a variable number of bounding boxes in different input images?". Previous methods train a multi-object detection network through a procedure that directly assigns the ground truth bounding boxes to the specific locations of the network's output. However, this procedure makes the training of a multi-object detection network too heuristic and complicated. In this paper, we reformulate the multi-object detection task as a problem of density estimation of bounding boxes. Instead of assigning each ground truth to specific locations of network's output, we train a network by estimating the probability density of bounding boxes in an input image using a mixture model. For this purpose, we propose a novel network for object detection called Mixture Density Object Detector (MDOD), and the corresponding objective function for the density-estimation-based training. We applied MDOD to MS COCO dataset. Our proposed method not only deals with multi-object detection problems in a new approach, but also improves detection performances through MDOD. The code is available: https://github.com/yoojy31/MDOD.

CVApr 14, 2024
Coreset Selection for Object Detection

Hojun Lee, Suyoung Kim, Junhoo Lee et al.

Coreset selection is a method for selecting a small, representative subset of an entire dataset. It has been primarily researched in image classification, assuming there is only one object per image. However, coreset selection for object detection is more challenging as an image can contain multiple objects. As a result, much research has yet to be done on this topic. Therefore, we introduce a new approach, Coreset Selection for Object Detection (CSOD). CSOD generates imagewise and classwise representative feature vectors for multiple objects of the same class within each image. Subsequently, we adopt submodular optimization for considering both representativeness and diversity and utilize the representative vectors in the submodular optimization process to select a subset. When we evaluated CSOD on the Pascal VOC dataset, CSOD outperformed random selection by +6.4%p in AP$_{50}$ when selecting 200 images.

CVMay 22, 2020
KL-Divergence-Based Region Proposal Network for Object Detection

Geonseok Seo, Jaeyoung Yoo, Jaeseok Choi et al.

The learning of the region proposal in object detection using the deep neural networks (DNN) is divided into two tasks: binary classification and bounding box regression task. However, traditional RPN (Region Proposal Network) defines these two tasks as different problems, and they are trained independently. In this paper, we propose a new region proposal learning method that considers the bounding box offset's uncertainty in the objectness score. Our method redefines RPN to a problem of minimizing the KL-divergence, difference between the two probability distributions. We applied KL-RPN, which performs region proposal using KL-Divergence, to the existing two-stage object detection framework and showed that it can improve the performance of the existing method. Experiments show that it achieves 2.6% and 2.0% AP improvements on MS COCO test-dev in Faster R-CNN with VGG-16 and R-FCN with ResNet-101 backbone, respectively.

LGFeb 12, 2019
Unpriortized Autoencoder For Image Generation

Jaeyoung Yoo, Hojun Lee, Nojun Kwak

In this paper, we treat the image generation task using an autoencoder, a representative latent model. Unlike many studies regularizing the latent variable's distribution by assuming a manually specified prior, we approach the image generation task using an autoencoder by directly estimating the latent distribution. To this end, we introduce 'latent density estimator' which captures latent distribution explicitly and propose its structure. Through experiments, we show that our generative model generates images with the improved visual quality compared to previous autoencoder-based generative models.

CVJul 9, 2018
Vehicle Image Generation Going Well with The Surroundings

Jeesoo Kim, Jangho Kim, Jaeyoung Yoo et al.

Since the generative neural networks have made a breakthrough in the image generation problem, lots of researches on their applications have been studied such as image restoration, style transfer and image completion. However, there has been few research generating objects in uncontrolled real-world environments. In this paper, we propose a novel approach for vehicle image generation in real-world scenes. Using a subnetwork based on a precedent work of image completion, our model makes the shape of an object. Details of objects are trained by an additional colorization and refinement subnetwork, resulting in a better quality of generated objects. Unlike many other works, our method does not require any segmentation layout but still makes a plausible vehicle in the image. We evaluate our method by using images from Berkeley Deep Drive (BDD) and Cityscape datasets, which are widely used for object detection and image segmentation problems. The adequacy of the generated images by the proposed method has also been evaluated using a widely utilized object detection algorithm and the FID score.

CVMay 23, 2018
Image Restoration by Estimating Frequency Distribution of Local Patches

Jaeyoung Yoo, Sang-ho Lee, Nojun Kwak

In this paper, we propose a method to solve the image restoration problem, which tries to restore the details of a corrupted image, especially due to the loss caused by JPEG compression. We have treated an image in the frequency domain to explicitly restore the frequency components lost during image compression. In doing so, the distribution in the frequency domain is learned using the cross entropy loss. Unlike recent approaches, we have reconstructed the details of an image without using the scheme of adversarial training. Rather, the image restoration problem is treated as a classification problem to determine the frequency coefficient for each frequency band in an image patch. In this paper, we show that the proposed method effectively restores a JPEG-compressed image with more detailed high frequency components, making the restored image more vivid.