CVJul 19, 2023
AesPA-Net: Aesthetic Pattern-Aware Style Transfer NetworksKibeom Hong, Seogkyu Jeon, Junsoo Lee et al.
To deliver the artistic expression of the target style, recent studies exploit the attention mechanism owing to its ability to map the local patches of the style image to the corresponding patches of the content image. However, because of the low semantic correspondence between arbitrary content and artworks, the attention module repeatedly abuses specific local patches from the style image, resulting in disharmonious and evident repetitive artifacts. To overcome this limitation and accomplish impeccable artistic style transfer, we focus on enhancing the attention mechanism and capturing the rhythm of patterns that organize the style. In this paper, we introduce a novel metric, namely pattern repeatability, that quantifies the repetition of patterns in the style image. Based on the pattern repeatability, we propose Aesthetic Pattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot of local and global style expressions. In addition, we propose a novel self-supervisory task to encourage the attention mechanism to learn precise and meaningful semantic correspondence. Lastly, we introduce the patch-wise style loss to transfer the elaborate rhythm of local patterns. Through qualitative and quantitative evaluations, we verify the reliability of the proposed pattern repeatability that aligns with human perception, and demonstrate the superiority of the proposed framework.
CVOct 25, 2022
Guiding Users to Where to Give Color Hints for Efficient Interactive Sketch Colorization via Unsupervised Region PrioritizationYoungin Cho, Junsoo Lee, Soyoung Yang et al.
Existing deep interactive colorization models have focused on ways to utilize various types of interactions, such as point-wise color hints, scribbles, or natural-language texts, as methods to reflect a user's intent at runtime. However, another approach, which actively informs the user of the most effective regions to give hints for sketch image colorization, has been under-explored. This paper proposes a novel model-guided deep interactive colorization framework that reduces the required amount of user interactions, by prioritizing the regions in a colorization model. Our method, called GuidingPainter, prioritizes these regions where the model most needs a color hint, rather than just relying on the user's manual decision on where to give a color hint. In our extensive experiments, we show that our approach outperforms existing interactive colorization methods in terms of the conventional metrics, such as PSNR and FID, and reduces required amount of interactions.
CVSep 13, 2023
DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion ModelsNamhyuk Ahn, Junsoo Lee, Chunggi Lee et al.
Recent progresses in large-scale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain. However, expressing unique characteristics of an artwork (e.g. brushwork, colortone, or composition) with text prompts alone may encounter limitations due to the inherent constraints of verbal description. To this end, we introduce DreamStyler, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer. DreamStyler optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality. In addition, with content and style guidance, DreamStyler exhibits flexibility to accommodate a range of style references. Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation.
CVMay 25, 2022
Cross-Domain Style Mixing for Face CartoonizationSeungkwon Kim, Chaeheon Gwak, Dohyun Kim et al.
Cartoon domain has recently gained increasing popularity. Previous studies have attempted quality portrait stylization into the cartoon domain; however, this poses a great challenge since they have not properly addressed the critical constraints, such as requiring a large number of training images or the lack of support for abstract cartoon faces. Recently, a layer swapping method has been used for stylization requiring only a limited number of training images; however, its use cases are still narrow as it inherits the remaining issues. In this paper, we propose a novel method called Cross-domain Style mixing, which combines two latent codes from two different domains. Our method effectively stylizes faces into multiple cartoon characters at various face abstraction levels using only a single generator without even using a large number of training images.
CVMay 24, 2023Code
DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion ModelsSungnyun Kim, Junsoo Lee, Kibeom Hong et al.
In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are available at https://github.com/sungnyun/diffblender.
CVMar 28, 2024
Imperceptible Protection against Style Imitation from Diffusion ModelsNamhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo et al.
Recent progress in diffusion models has profoundly enhanced the fidelity of image generation, but it has raised concerns about copyright infringements. While prior methods have introduced adversarial perturbations to prevent style imitation, most are accompanied by the degradation of artworks' visual quality. Recognizing the importance of maintaining this, we introduce a visually improved protection method while preserving its protection capability. To this end, we devise a perceptual map to highlight areas sensitive to human eyes, guided by instance-aware refinement, which refines the protection intensity accordingly. We also introduce a difficulty-aware protection by predicting how difficult the artwork is to protect and dynamically adjusting the intensity based on this. Lastly, we integrate a perceptual constraints bank to further improve the imperceptibility. Results show that our method substantially elevates the quality of the protected image without compromising on protection efficacy.
CVDec 16, 2024
Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion ModelsNamhyuk Ahn, KiYoon Yoo, Wonhyuk Ahn et al.
Recent advancements in diffusion models revolutionize image generation but pose risks of misuse, such as replicating artworks or generating deepfakes. Existing image protection methods, though effective, struggle to balance protection efficacy, invisibility, and latency, thus limiting practical use. We introduce perturbation pre-training to reduce latency and propose a mixture-of-perturbations approach that dynamically adapts to input images to minimize performance degradation. Our novel training strategy computes protection loss across multiple VAE feature spaces, while adaptive targeted protection at inference enhances robustness and invisibility. Experiments show comparable protection performance with improved invisibility and drastically reduced inference time. The code and demo are available at https://webtoon.github.io/impasto
CVMay 19, 2023
Chupa: Carving 3D Clothed Humans from Skinned Shape Priors using 2D Diffusion Probabilistic ModelsByungjun Kim, Patrick Kwon, Kwangho Lee et al.
We propose a 3D generation pipeline that uses diffusion models to generate realistic human digital avatars. Due to the wide variety of human identities, poses, and stochastic details, the generation of 3D human meshes has been a challenging problem. To address this, we decompose the problem into 2D normal map generation and normal map-based 3D reconstruction. Specifically, we first simultaneously generate realistic normal maps for the front and backside of a clothed human, dubbed dual normal maps, using a pose-conditional diffusion model. For 3D reconstruction, we "carve" the prior SMPL-X mesh to a detailed 3D mesh according to the normal maps through mesh optimization. To further enhance the high-frequency details, we present a diffusion resampling scheme on both body and facial regions, thus encouraging the generation of realistic digital avatars. We also seamlessly incorporate a recent text-to-image diffusion model to support text-based human identity control. Our method, namely, Chupa, is capable of generating realistic 3D clothed humans with better perceptual quality and identity variety.
CVFeb 25, 2021
Maximizing Cosine Similarity Between Spatial Features for Unsupervised Domain Adaptation in Semantic SegmentationInseop Chung, Daesik Kim, Nojun Kwak
We propose a novel method that tackles the problem of unsupervised domain adaptation for semantic segmentation by maximizing the cosine similarity between the source and the target domain at the feature level. A segmentation network mainly consists of two parts, a feature extractor and a classification head. We expect that if we can make the two domains have small domain gap at the feature level, they would also have small domain discrepancy at the classification head. Our method computes a cosine similarity matrix between the source feature map and the target feature map, then we maximize the elements exceeding a threshold to guide the target features to have high similarity with the most similar source feature. Moreover, we use a class-wise source feature dictionary which stores the latest features of the source domain to prevent the unmatching problem when computing the cosine similarity matrix and be able to compare a target feature with various source features from various images. Through extensive experiments, we verify that our method gains performance on two unsupervised domain adaptation tasks (GTA5$\to$ Cityscaspes and SYNTHIA$\to$ Cityscapes).
MMAug 14, 2020
From Attack to Protection: Leveraging Watermarking Attack Network for Advanced Add-on WatermarkingSeung-Hun Nam, Jihyeon Kang, Daesik Kim et al.
Multi-bit watermarking (MW) has been designed to enhance resistance against watermarking attacks, such as signal processing operations and geometric distortions. Various benchmark tools exist to assess this robustness through simulated attacks on watermarked images. However, these tools often fail to capitalize on the unique attributes of the targeted MW and typically neglect the aspect of visual quality, a critical factor in practical applications. To overcome these shortcomings, we introduce a watermarking attack network (WAN), a fully trainable watermarking benchmark tool designed to exploit vulnerabilities within MW systems and induce watermark bit inversions, significantly diminishing watermark extractability. The proposed WAN employs an architecture based on residual dense blocks, which is adept at both local and global feature learning, thereby maintaining high visual quality while obstructing the extraction of embedded information. Our empirical results demonstrate that the WAN effectively undermines various block-based MW systems while minimizing visual degradation caused by attacks. This is facilitated by our novel watermarking attack loss, which is specifically crafted to compromise these systems. The WAN functions not only as a benchmarking tool but also as an add-on watermarking (AoW) mechanism, augmenting established universal watermarking schemes by enhancing robustness or imperceptibility without requiring detailed method context and adapting to dynamic watermarking requirements. Extensive experimental results show that AoW complements the performance of the targeted MW system by independently enhancing both imperceptibility and robustness.
CVNov 19, 2019
Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object InteractionDaesik Kim, Gyujeong Lee, Jisoo Jeong et al.
In this work, we introduce a novel weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that have not many examples using transferable knowledge from human-object interactions (HOI). While WSOD shows lower performance than full supervision, we mainly focus on HOI as the main context which can strongly supervise complex semantics in images. Therefore, we propose a novel module called RRPN (relational region proposal network) which outputs an object-localizing attention map only with human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervision of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to the object detection but also to other domains such as semantic segmentation. The experimental results on HICO-DET dataset show the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects on HICO-DET and V-COCO datasets.
CLNov 1, 2018
Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set ComprehensionDaesik Kim, Seonhoon Kim, Nojun Kwak
In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of the TQA dataset. First, solving the TQA problems requires to comprehend multi-modal contexts in complicated input data. To tackle this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images, and propose a new module f-GCN based on graph convolutional networks (GCN). Second, scientific terms are not spread over the chapters and subjects are split in the TQA dataset. To overcome this so called "out-of-domain" issue, before learning QA problems, we introduce a novel self-supervised open-set learning process without any annotations. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating f-GCN for extracting knowledge from multi-modal contexts and our newly proposed self-supervised learning process are effective for TQA problems.
CVJul 9, 2018
Vehicle Image Generation Going Well with The SurroundingsJeesoo Kim, Jangho Kim, Jaeyoung Yoo et al.
Since the generative neural networks have made a breakthrough in the image generation problem, lots of researches on their applications have been studied such as image restoration, style transfer and image completion. However, there has been few research generating objects in uncontrolled real-world environments. In this paper, we propose a novel approach for vehicle image generation in real-world scenes. Using a subnetwork based on a precedent work of image completion, our model makes the shape of an object. Details of objects are trained by an additional colorization and refinement subnetwork, resulting in a better quality of generated objects. Unlike many other works, our method does not require any segmentation layout but still makes a plausible vehicle in the image. We evaluate our method by using images from Berkeley Deep Drive (BDD) and Cityscape datasets, which are widely used for object detection and image segmentation problems. The adequacy of the generated images by the proposed method has also been evaluated using a widely utilized object detection algorithm and the FID score.
CVNov 27, 2017
Dynamic Graph Generation Network: Generating Relational Knowledge from DiagramsDaesik Kim, Youngjoon Yoo, Jeesoo Kim et al.
In this work, we introduce a new algorithm for analyzing a diagram, which contains visual and textual information in an abstract and integrated way. Whereas diagrams contain richer information compared with individual image-based or language-based data, proper solutions for automatically understanding them have not been proposed due to their innate characteristics of multi-modality and arbitrariness of layouts. To tackle this problem, we propose a unified diagram-parsing network for generating knowledge from diagrams based on an object detector and a recurrent neural network designed for a graphical structure. Specifically, we propose a dynamic graph-generation network that is based on dynamic memory and graph theory. We explore the dynamics of information in a diagram with activation of gates in gated recurrent unit (GRU) cells. On publicly available diagram datasets, our model demonstrates a state-of-the-art result that outperforms other baselines. Moreover, further experiments on question answering shows potentials of the proposed method for various applications.
CVJul 2, 2017
Where to Play: Retrieval of Video Segments using Natural-Language QueriesSangkuk Lee, Daesik Kim, Myunggi Lee et al.
In this paper, we propose a new approach for retrieval of video segments using natural language queries. Unlike most previous approaches such as concept-based methods or rule-based structured models, the proposed method uses image captioning model to construct sentential queries for visual information. In detail, our approach exploits multiple captions generated by visual features in each image with `Densecap'. Then, the similarities between captions of adjacent images are calculated, which is used to track semantically similar captions over multiple frames. Besides introducing this novel idea of 'tracking by captioning', the proposed method is one of the first approaches that uses a language generation model learned by neural networks to construct semantic query describing the relations and properties of visual information. To evaluate the effectiveness of our approach, we have created a new evaluation dataset, which contains about 348 segments of scenes in 20 movie-trailers. Through quantitative and qualitative evaluation, we show that our method is effective for retrieval of video segments using natural language queries.