Ayşegül Dündar

h-index18

34papers

1,122citations

Novelty50%

AI Score47

Ranked #32,263 of 194,257 authors (top 17%)#11,527 in CV (top 19%)

34 Papers

14.1CVMar 17, 2022

Fine Detailed Texture Learning for 3D Meshes with Generative Models

Aysegul Dundar, Jun Gao, Andrew Tao et al.

This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images. The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network. In the generative learning pipeline, we propose two improvements. First, since the learned textures should be spatially aligned, we propose an attention mechanism that relies on the learnable positions of pixels. Secondly, since discriminator receives aligned texture maps, we augment its input with a learnable embedding which improves the feedback to the generator. We achieve significant improvements on multi-view sequences from Tripod dataset as well as on single-view image datasets, Pascal 3D+ and CUB. We demonstrate that our method achieves superior 3D textured models compared to the previous works. Please visit our web-page for 3D visuals.

1.5CVSep 25, 2023Code

Diverse Semantic Image Editing with Style Codes

Hakan Sivuk, Aysegul Dundar

Semantic image editing requires inpainting pixels following a semantic map. It is a challenging task since this inpainting requires both harmony with the context and strict compliance with the semantic maps. The majority of the previous methods proposed for this task try to encode the whole information from erased images. However, when an object is added to a scene such as a car, its style cannot be encoded from the context alone. On the other hand, the models that can output diverse generations struggle to output images that have seamless boundaries between the generated and unerased parts. Additionally, previous methods do not have a mechanism to encode the styles of visible and partially visible objects differently for better performance. In this work, we propose a framework that can encode visible and partially visible objects with a novel mechanism to achieve consistency in the style encoding and final generations. We extensively compare with previous conditional image generation and semantic image editing algorithms. Our extensive experiments show that our method significantly improves over the state-of-the-art. Our method not only achieves better quantitative results but also provides diverse results. Please refer to the project web page for the released code and demo: https://github.com/hakansivuk/DivSem.

26.8CVApr 6, 2023Code

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem et al.

Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.

15.3CVJul 7, 2022

VecGAN: Image-to-Image Translation with Interpretable Latent Directions

Yusuf Dalva, Said Fahri Altindis, Aysegul Dundar

We propose VecGAN, an image-to-image translation framework for facial attribute editing with interpretable latent directions. Facial attribute editing task faces the challenges of precise attribute editing with controllable strength and preservation of the other attributes of an image. For this goal, we design the attribute editing by latent space factorization and for each attribute, we learn a linear direction that is orthogonal to the others. The other component is the controllable strength of the change, a scalar value. In our framework, this scalar can be either sampled or encoded from a reference image by projection. Our work is inspired by the latent space factorization works of fixed pretrained GANs. However, while those models cannot be trained end-to-end and struggle to edit encoded images precisely, VecGAN is end-to-end trained for image translation task and successful at editing an attribute while preserving the others. Our extensive experiments show that VecGAN achieves significant improvements over state-of-the-arts for both local and global edits.

16.8CVJul 27, 2023

Diverse Inpainting and Editing with GAN Inversion

Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen et al.

Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements.

5.9CVMar 6, 2023Code

Refining 3D Human Texture Estimation from a Single Image

Said Fahri Altindis, Adil Meric, Yusuf Dalva et al.

Estimating 3D human texture from a single image is essential in graphics and vision. It requires learning a mapping function from input images of humans with diverse poses into the parametric (UV) space and reasonably hallucinating invisible parts. To achieve a high-quality 3D human texture estimation, we propose a framework that adaptively samples the input by a deformable convolution where offsets are learned via a deep neural network. Additionally, we describe a novel cycle consistency loss that improves view generalization. We further propose to train our framework with an uncertainty-based pixel-level image reconstruction loss, which enhances color fidelity. We compare our method against the state-of-the-art approaches and show significant qualitative and quantitative improvements.

8.7CVSep 30, 2024

Dual Encoder GAN Inversion for High-Fidelity 3D Head Reconstruction from Single Images

Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Aysegul Dundar

3D GAN inversion aims to project a single image into the latent space of a 3D Generative Adversarial Network (GAN), thereby achieving 3D geometry reconstruction. While there exist encoders that achieve good results in 3D GAN inversion, they are predominantly built on EG3D, which specializes in synthesizing near-frontal views and is limiting in synthesizing comprehensive 3D scenes from diverse viewpoints. In contrast to existing approaches, we propose a novel framework built on PanoHead, which excels in synthesizing images from a 360-degree perspective. To achieve realistic 3D modeling of the input image, we introduce a dual encoder system tailored for high-fidelity reconstruction and realistic generation from different viewpoints. Accompanying this, we propose a stitching framework on the triplane domain to get the best predictions from both. To achieve seamless stitching, both encoders must output consistent results despite being specialized for different tasks. For this reason, we carefully train these encoders using specialized losses, including an adversarial loss based on our novel occlusion-aware triplane discriminator. Experiments reveal that our approach surpasses the existing encoder training methods qualitatively and quantitatively. Please visit the project page: https://berkegokmen1.github.io/dual-enc-3d-gan-inv.

6.2CVJun 29, 2025Code

Trident: Detecting Face Forgeries with Adversarial Triplet Learning

Mustafa Hakan Kara, Aysegul Dundar, Uğur Güdükbay

As face forgeries generated by deep neural networks become increasingly sophisticated, detecting face manipulations in digital media has posed a significant challenge, underscoring the importance of maintaining digital media integrity and combating visual disinformation. Current detection models, predominantly based on supervised training with domain-specific data, often falter against forgeries generated by unencountered techniques. In response to this challenge, we introduce \textit{Trident}, a face forgery detection framework that employs triplet learning with a Siamese network architecture for enhanced adaptability across diverse forgery methods. \textit{Trident} is trained on curated triplets to isolate nuanced differences of forgeries, capturing fine-grained features that distinguish pristine samples from manipulated ones while controlling for other variables. To further enhance generalizability, we incorporate domain-adversarial training with a forgery discriminator. This adversarial component guides our embedding model towards forgery-agnostic representations, improving its robustness to unseen manipulations. In addition, we prevent gradient flow from the classifier head to the embedding model, avoiding overfitting induced by artifacts peculiar to certain forgeries. Comprehensive evaluations across multiple benchmarks and ablation studies demonstrate the effectiveness of our framework. We will release our code in a GitHub repository.

8.8CVDec 29, 2022Code

StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN

Hamza Pehlivan, Yusuf Dalva, Aysegul Dundar

We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. Code: https://github.com/hamzapehlivan/StyleRes

16.9CVMar 31, 2021Code

Dual Contrastive Loss and Attention for GANs

Ning Yu, Guilin Liu, Aysegul Dundar et al.

Generative Adversarial Networks (GANs) produce impressive results on unconditional image generation when powered with large-scale image datasets. Yet generated images are still easy to spot especially on datasets with high variance (e.g. bedroom, church). In this paper, we propose various improvements to further push the boundaries in image generation. Specifically, we propose a novel dual contrastive loss and show that, with this loss, discriminator learns more generalized and distinguishable representations to incentivize generation. In addition, we revisit attention and extensively experiment with different attention blocks in the generator. We find attention to be still an important module for successful image generation even though it was not used in the recent state-of-the-art models. Lastly, we study different attention architectures in the discriminator, and propose a reference attention mechanism. By combining the strengths of these remedies, we improve the compelling state-of-the-art Fréchet Inception Distance (FID) by at least 17.5% on several benchmark datasets. We obtain even more significant improvements on compositional synthetic scenes (up to 47.5% in FID). Code and models are available at https://github.com/ningyu1991/AttentionDualContrastGAN .

12.8CVApr 4, 2024

Reference-Based 3D-Aware Image Editing with Triplanes

Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu et al.

Generative Adversarial Networks (GANs) have emerged as powerful tools for high-quality image generation and real image editing by manipulating their latent spaces. Recent advancements in GANs include 3D-aware models such as EG3D, which feature efficient triplane-based architectures capable of reconstructing 3D geometry from single images. However, limited attention has been given to providing an integrated framework for 3D-aware, high-quality, reference-based image editing. This study addresses this gap by exploring and demonstrating the effectiveness of the triplane space for advanced reference-based edits. Our novel approach integrates encoding, automatic localization, spatial disentanglement of triplane features, and fusion learning to achieve the desired edits. We demonstrate how our approach excels across diverse domains, including human faces, 360-degree heads, animal faces, partially stylized edits like cartoon faces, full-body clothing edits, and edits on class-agnostic samples. Our method shows state-of-the-art performance over relevant latent direction, text, and image-guided 2D and 3D-aware diffusion and GAN methods, both qualitatively and quantitatively.

6.8CVDec 18, 2023

Warping the Residuals for Image Editing with StyleGAN

Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar

StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN's latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low-rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher-rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. The flows are estimated from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and warping them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.

14.4CVApr 30, 2025

3D Stylization via Large Reconstruction Model

Ipek Oztas, Duygu Ceylan, Aysegul Dundar

With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes.

2.8CVFeb 15

GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction

Ahmet Burak Yildirim, Tuna Saygin, Duygu Ceylan et al.

Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model's own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.

17.4CVMay 19, 2025

RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen et al.

We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

6.2CVApr 3, 2025

MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

Ahmet Burak Yildirim, Mustafa Utku Aydogdu, Duygu Ceylan et al.

We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

5.2CVNov 20, 2024

Identity Preserving 3D Head Stylization with Multiview Score Distillation

Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Furkan Guzelant et al.

3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.

18.6CVJun 13, 2024Code

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Yigit Ekin, Ahmet Burak Yildirim, Erdem Eren Caglar et al.

Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.

12.1CVMay 18, 2023

Progressive Learning of 3D Reconstruction Network from 2D GAN Data

Aysegul Dundar, Jun Gao, Andrew Tao et al.

This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted images. This results in degraded reconstruction qualities. In this work, to overcome these limitations of generated datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial learning pipeline with online pseudo-ground truth generations to achieve fine details. Our work provides a bridge from 2D supervisions of GAN models to 3D reconstruction models and removes the expensive annotation efforts. We show significant improvements over previous methods whether they were trained on GAN generated multi-view images or on real images with expensive annotations. Please visit our web-page for 3D visuals: https://research.nvidia.com/labs/adlr/progressive-3d-learning

5.6CVSep 2, 2021

Benchmarking the Robustness of Instance Segmentation Models

Yusuf Dalva, Hamza Pehlivan, Said Fahri Altindis et al.

This paper presents a comprehensive evaluation of instance segmentation models with respect to real-world image corruptions as well as out-of-domain image collections, e.g. images captured by a different set-up than the training dataset. The out-of-domain image evaluation shows the generalization capability of models, an essential aspect of real-world applications and an extensively studied topic of domain adaptation. These presented robustness and generalization evaluations are important when designing instance segmentation models for real-world applications and picking an off-the-shelf pretrained model to directly use for the task at hand. Specifically, this benchmark study includes state-of-the-art network architectures, network backbones, normalization layers, models trained starting from scratch versus pretrained networks, and the effect of multi-task training on robustness and generalization. Through this study, we gain several insights. For example, we find that group normalization enhances the robustness of networks across corruptions where the image contents stay the same but corruptions are added on top. On the other hand, batch normalization improves the generalization of the models across different datasets where statistics of image features change. We also find that single-stage detectors do not generalize well to larger image resolutions than their training size. On the other hand, multi-stage detectors can easily be used on images of different sizes. We hope that our comprehensive study will motivate the development of more robust and reliable instance segmentation models.

15.5CVJun 10, 2021

View Generalization for Single Image Textured 3D Models

Anand Bhattad, Aysegul Dundar, Guilin Liu et al.

Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training error; bias) with novel view accuracy (cf. test error; variance). We describe a class of models whose geometric rigidity is easily controlled to manage this tradeoff. We describe a cycle consistency loss that improves view generalization (roughly, a model from a generated view should predict the original view well). View generalization of textures requires that models share texture information, so a car seen from the back still has headlights because other cars have headlights. We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing. We compare our method against the state-of-the-art method and show both qualitative and quantitative improvements.

16.0CVApr 21, 2020

Panoptic-based Image Synthesis

Aysegul Dundar, Karan Sapra, Guilin Liu et al.

Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic images conditioned on panoptic maps which unify semantic and instance information. To achieve this, we efficiently use panoptic maps in convolution and upsampling layers. We show that with the proposed changes to the generator, we can improve on the previous state-of-the-art methods by generating images in complex instance interaction environments in higher fidelity and tiny objects in more details. Furthermore, our proposed method also outperforms the previous state-of-the-art methods in metrics of mean IoU (Intersection over Union), and detAP (Detection Average Precision).

13.6CVJan 26, 2020

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

Aysegul Dundar, Kevin J. Shih, Animesh Garg et al.

Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction task.

4.1CVSep 6, 2019

Video Interpolation and Prediction with Unsupervised Landmarks

Kevin J. Shih, Aysegul Dundar, Animesh Garg et al.

Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space, achieving long-range predictions. However, these latent representations are often non-interpretable, and therefore difficult to manipulate. This work poses video prediction and interpolation as unsupervised latent structure inference followed by a temporal prediction in this latent space. The latent representations capture foreground semantics without explicit supervision such as keypoints or poses. Further, as each landmark can be mapped to a coordinate indicating where a semantic part is positioned, we can reliably interpolate within the coordinate domain to achieve predictable motion interpolation. Given an image decoder capable of mapping these landmarks back to the image domain, we are able to achieve high-quality long-range video interpolation and extrapolation by operating on the landmark representation space.

20.4CVJun 13, 2019

Unsupervised Video Interpolation Using Cycle Consistency

Fitsum A. Reda, Deqing Sun, Aysegul Dundar et al.

Learning to synthesize high frame rate videos via interpolation requires large quantities of high frame rate training videos, which, however, are scarce, especially at high resolutions. Here, we propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. For a triplet of consecutive frames, we optimize models to minimize the discrepancy between the center frame and its cycle reconstruction, obtained by interpolating back from interpolated intermediate frames. This simple unsupervised constraint alone achieves results comparable with supervision using the ground truth intermediate frames. We further introduce a pseudo supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. With no additional data and in a completely unsupervised fashion, our techniques significantly improve pre-trained models on new target domains, increasing PSNR values from 32.84dB to 33.05dB on the Slowflow and from 31.82dB to 32.53dB on the Sintel evaluation datasets.

20.0CVJul 24, 2018

Domain Stylization: A Strong, Simple Baseline for Synthetic to Real Image Domain Adaptation

Aysegul Dundar, Ming-Yu Liu, Ting-Chun Wang et al.

Deep neural networks have largely failed to effectively utilize synthetic data when applied to real images due to the covariate shift problem. In this paper, we show that by applying a straightforward modification to an existing photorealistic style transfer algorithm, we achieve state-of-the-art synthetic-to-real domain adaptation results. We conduct extensive experimental validations on four synthetic-to-real tasks for semantic segmentation and object detection, and show that our approach exceeds the performance of any current state-of-the-art GAN-based image translation approach as measured by segmentation and object detection metrics. Furthermore we offer a distance based analysis of our method which shows a dramatic reduction in Frechet Inception distance between the source and target domains, offering a quantitative metric that demonstrates the effectiveness of our algorithm in bridging the synthetic-to-real gap.

4.4CVNov 22, 2017

Context Augmentation for Convolutional Neural Networks

Aysegul Dundar, Ignacio Garcia-Dorado

Recent enhancements of deep convolutional neural networks (ConvNets) empowered by enormous amounts of labeled data have closed the gap with human performance for many object recognition tasks. These impressive results have generated interest in understanding and visualization of ConvNets. In this work, we study the effect of background in the task of image classification. Our results show that changing the backgrounds of the training datasets can have drastic effects on testing accuracies. Furthermore, we enhance existing augmentation techniques with the foreground segmented objects. The findings of this work are important in increasing the accuracies when only a small dataset is available, in creating datasets, and creating synthetic images.

0.7LGJun 15, 2017

Human-like Clustering with Deep Convolutional Neural Networks

Ali Borji, Aysegul Dundar

Classification and clustering have been studied separately in machine learning and computer vision. Inspired by the recent success of deep learning models in solving various vision problems (e.g., object recognition, semantic segmentation) and the fact that humans serve as the gold standard in assessing clustering algorithms, here, we advocate for a unified treatment of the two problems and suggest that hierarchical frameworks that progressively build complex patterns on top of the simpler ones (e.g., convolutional neural networks) offer a promising solution. We do not dwell much on the learning mechanisms in these frameworks as they are still a matter of debate, with respect to biological constraints. Instead, we emphasize on the compositionality of the real world structures and objects. In particular, we show that CNNs, trained end to end using back propagation with noisy labels, are able to cluster data points belonging to several overlapping shapes, and do so much better than the state of the art algorithms. The main takeaway lesson from our study is that mechanisms of human vision, particularly the hierarchal organization of the visual ventral stream should be taken into account in clustering algorithms (e.g., for learning representations in an unsupervised manner or with minimum supervision) to reach human level clustering performance. This, by no means, suggests that other methods do not hold merits. For example, methods relying on pairwise affinities (e.g., spectral clustering) have been very successful in many scenarios but still fail in some cases (e.g., overlapping clusters).

20.3LGNov 19, 2015

Robust Convolutional Neural Networks under Adversarial Noise

Jonghoon Jin, Aysegul Dundar, Eugenio Culurciello

Recent studies have shown that Convolutional Neural Networks (CNNs) are vulnerable to a small perturbation of input called "adversarial examples". In this work, we propose a new feedforward CNN that improves robustness in the presence of adversarial noise. Our model uses stochastic additive noise added to the input image and to the CNN models. The proposed model operates in conjunction with a CNN trained with either standard or adversarial objective function. In particular, convolution, max-pooling, and ReLU layers are modified to benefit from the noise model. Our feedforward model is parameterized by only a mean and variance per pixel which simplifies computations and makes our method scalable to a deep architecture. From CIFAR-10 and ImageNet test, the proposed model outperforms other methods and the improvement is more evident for difficult classification tasks or stronger adversarial noise.

15.8LGNov 19, 2015

Convolutional Clustering for Unsupervised Learning

Aysegul Dundar, Jonghoon Jin, Eugenio Culurciello

The task of labeling data for training deep neural networks is daunting and tedious, requiring millions of labels to achieve the current state-of-the-art results. Such reliance on large amounts of labeled data can be relaxed by exploiting hierarchical features via unsupervised learning techniques. In this work, we propose to train a deep convolutional network based on an enhanced version of the k-means clustering algorithm, which reduces the number of correlated parameters in the form of similar filters, and thus increases test categorization accuracy. We call our algorithm convolutional k-means clustering. We further show that learning the connection between the layers of a deep convolutional neural network improves its ability to be trained on a smaller amount of labeled data. Our experiments show that the proposed algorithm outperforms other techniques that learn filters unsupervised. Specifically, we obtained a test accuracy of 74.1% on STL-10 and a test error of 0.5% on MNIST.

34.7NEDec 17, 2014Code

Flattened Convolutional Neural Networks for Feedforward Acceleration

Jonghoon Jin, Aysegul Dundar, Eugenio Culurciello

We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained.

15.2CVJun 1, 2013

An Analysis of the Connections Between Layers of Deep Neural Networks

Eugenio Culurciello, Jonghoon Jin, Aysegul Dundar et al.

We present an analysis of different techniques for selecting the connection be- tween layers of deep neural networks. Traditional deep neural networks use ran- dom connection tables between layers to keep the number of connections small and tune to different image features. This kind of connection performs adequately in supervised deep networks because their values are refined during the training. On the other hand, in unsupervised learning, one cannot rely on back-propagation techniques to learn the connections between layers. In this work, we tested four different techniques for connecting the first layer of the network to the second layer on the CIFAR and SVHN datasets and showed that the accuracy can be im- proved up to 3% depending on the technique used. We also showed that learning the connections based on the co-occurrences of the features does not confer an advantage over a random connection table in small networks. This work is helpful to improve the efficiency of connections between the layers of unsupervised deep neural networks.

5.5CVJan 13, 2013

Clustering Learning for Robotic Vision

Eugenio Culurciello, Jordan Bates, Aysegul Dundar et al.

We present the clustering learning technique applied to multi-layer feedforward deep neural networks. We show that this unsupervised learning technique can compute network filters with only a few minutes and a much reduced set of parameters. The goal of this paper is to promote the technique for general-purpose robotic vision systems. We report its use in static image datasets and object tracking datasets. We show that networks trained with clustering learning can outperform large networks trained for many hours on complex datasets.

2.8CVSep 12, 2012

Visual Tracking with Similarity Matching Ratio

Aysegul Dundar, Jonghoon Jin, Eugenio Culurciello

This paper presents a novel approach to visual tracking: Similarity Matching Ratio (SMR). The traditional approach of tracking is minimizing some measures of the difference between the template and a patch from the frame. This approach is vulnerable to outliers and drastic appearance changes and an extensive study is focusing on making the approach more tolerant to them. However, this often results in longer, corrective algo- rithms which do not solve the original problem. This paper proposes a novel approach to the definition of the tracking problems, SMR, which turns the differences into a probability measure. Only pixel differences below a threshold count towards deciding the match, the rest are ignored. This approach makes the SMR tracker robust to outliers and points that dramaticaly change appearance. The SMR tracker is tested on challenging video sequences and achieved state-of-the-art performance.