Nanxuan Zhao

CV
h-index41
44papers
1,898citations
Novelty57%
AI Score60

44 Papers

CVMay 4, 2022Code
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition

Haodong Duan, Nanxuan Zhao, Kai Chen et al. · pku

Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for self-supervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative RecogTrans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.

CVMar 24, 2023
Grid-guided Neural Radiance Fields for Large Urban Scenes

Linning Xu, Yuanbo Xiangli, Sida Peng et al.

Purely MLP-based neural radiance fields (NeRF-based methods) often suffer from underfitting with blurred renderings on large-scale scenes due to limited model capacity. Recent approaches propose to geographically divide the scene and adopt multiple sub-NeRFs to model each region individually, leading to linear scale-up in training costs and the number of sub-NeRFs as the scene expands. An alternative solution is to use a feature grid representation, which is computationally efficient and can naturally scale to a large scene with increased grid resolutions. However, the feature grid tends to be less constrained and often reaches suboptimal solutions, producing noisy artifacts in renderings, especially in regions with complex geometry and texture. In this work, we present a new framework that realizes high-fidelity rendering on large urban scenes while being computationally efficient. We propose to use a compact multiresolution ground feature plane representation to coarsely capture the scene, and complement it with positional encoding inputs through another NeRF branch for rendering in a joint learning fashion. We show that such an integration can utilize the advantages of two alternative solutions: a light-weighted NeRF is sufficient, under the guidance of the feature grid representation, to render photorealistic novel views with fine details; and the jointly optimized ground feature planes, can meanwhile gain further refinements, forming a more accurate and compact feature space and output much more natural rendering results.

CVMar 23, 2023
Neural Preset for Color Style Transfer

Zhanghan Ke, Yuhao Liu, Lei Zhu et al.

In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Notably, Neural Preset enables stable 4K color style transfer in real-time without artifacts. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization. Project page with demos: https://zhkkke.github.io/NeuralPreset .

CVAug 6, 2023
Language-based Photo Color Adjustment for Graphic Designs

Zhenwei Wang, Nanxuan Zhao, Gerhard Hancke et al.

Adjusting the photo color to associate with some design elements is an essential way for a graphic design to effectively deliver its message and make it aesthetically pleasing. However, existing tools and previous works face a dilemma between the ease of use and level of expressiveness. To this end, we introduce an interactive language-based approach for photo recoloring, which provides an intuitive system that can assist both experts and novices on graphic design. Given a graphic design containing a photo that needs to be recolored, our model can predict the source colors and the target regions, and then recolor the target regions with the source colors based on the given language-based instruction. The multi-granularity of the instruction allows diverse user intentions. The proposed novel task faces several unique challenges, including: 1) color accuracy for recoloring with exactly the same color from the target design element as specified by the user; 2) multi-granularity instructions for parsing instructions correctly to generate a specific result or multiple plausible ones; and 3) locality for recoloring in semantically meaningful local regions to preserve original image semantics. To address these challenges, we propose a model called LangRecol with two main components: the language-based source color prediction module and the semantic-palette-based photo recoloring module. We also introduce an approach for generating a synthetic graphic design dataset with instructions to enable model training. We evaluate our model via extensive experiments and user studies. We also discuss several practical applications, showing the effectiveness and practicality of our approach. Code and data for this paper are at: https://zhenwwang.github.io/langrecol.

CVMar 24, 2023
AssetField: Assets Mining and Reconfiguration in Ground Feature Plane Representation

Yuanbo Xiangli, Linning Xu, Xingang Pan et al.

Both indoor and outdoor environments are inherently structured and repetitive. Traditional modeling pipelines keep an asset library storing unique object templates, which is both versatile and memory efficient in practice. Inspired by this observation, we propose AssetField, a novel neural scene representation that learns a set of object-aware ground feature planes to represent the scene, where an asset library storing template feature patches can be constructed in an unsupervised manner. Unlike existing methods which require object masks to query spatial points for object editing, our ground feature plane representation offers a natural visualization of the scene in the bird-eye view, allowing a variety of operations (e.g. translation, duplication, deformation) on objects to configure a new scene. With the template feature patches, group editing is enabled for scenes with many recurring items to avoid repetitive work on object individuals. We show that AssetField not only achieves competitive performance for novel-view synthesis but also generates realistic renderings for new scene configurations.

CVOct 20, 2023
Localizing and Editing Knowledge in Text-to-Image Generative Models

Samyadeep Basu, Nanxuan Zhao, Vlad Morariu et al.

Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.

CVAug 23, 2023Code
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

Ziyu Yang, Sucheng Ren, Zongwei Wu et al.

Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. {Our dataset and code can be found at \url{https://github.com/Yangziyu/NPF200}}.

CVSep 22, 2022
UniColor: A Unified Framework for Multi-Modal Colorization with Transformer

Zhitong Huang, Nanxuan Zhao, Jing Liao

We propose the first unified framework UniColor to support colorization in multiple modalities, including both unconditional and conditional ones, such as stroke, exemplar, text, and even a mix of them. Rather than learning a separate model for each type of condition, we introduce a two-stage colorization framework for incorporating various conditions into a single model. In the first stage, multi-modal conditions are converted into a common representation of hint points. Particularly, we propose a novel CLIP-based method to convert the text to hint points. In the second stage, we propose a Transformer-based network composed of Chroma-VQGAN and Hybrid-Transformer to generate diverse and high-quality colorization results conditioned on hint points. Both qualitative and quantitative comparisons demonstrate that our method outperforms state-of-the-art methods in every control modality and further enables multi-modal colorization that was not feasible before. Moreover, we design an interactive interface showing the effectiveness of our unified framework in practical usage, including automatic colorization, hybrid-control colorization, local recolorization, and iterative color editing. Our code and models are available at https://luckyhzt.github.io/unicolor.

CVSep 21, 2023
Text-Guided Vector Graphics Customization

Peiying Zhang, Nanxuan Zhao, Jing Liao

Vector graphics are widely used in digital art and valued by designers for their scalability and layer-wise topological properties. However, the creation and editing of vector graphics necessitate creativity and design expertise, leading to a time-consuming process. In this paper, we propose a novel pipeline that generates high-quality customized vector graphics based on textual prompts while preserving the properties and layer-wise information of a given exemplar SVG. Our method harnesses the capabilities of large pre-trained text-to-image models. By fine-tuning the cross-attention layers of the model, we generate customized raster images guided by textual prompts. To initialize the SVG, we introduce a semantic-based path alignment method that preserves and transforms crucial paths from the exemplar SVG. Additionally, we optimize path parameters using both image-level and vector-level losses, ensuring smooth shape deformation while aligning with the customized raster image. We extensively evaluate our method using multiple metrics from vector-level, image-level, and text-level perspectives. The evaluation results demonstrate the effectiveness of our pipeline in generating diverse customizations of vector graphics with exceptional quality. The project page is https://intchous.github.io/SVGCustomization.

CVSep 22, 2024
GroupDiff: Diffusion-based Group Portrait Editing

Yuming Jiang, Nanxuan Zhao, Qing Liu et al.

Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos.

65.8CVMar 25
SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons

Haiyang Xu, Ronghuan Wu, Li-Yi Wei et al.

Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: https://xxuhaiyang.github.io/SemLayer/

AIMar 24, 2025Code
From Fragment to One Piece: A Survey on AI-Driven Graphic Design

Xingxing Zou, Wen Zhang, Nanxuan Zhao

This survey provides a comprehensive overview of the advancements in Artificial Intelligence in Graphic Design (AIGD), focusing on integrating AI techniques to support design interpretation and enhance the creative process. We categorize the field into two primary directions: perception tasks, which involve understanding and analyzing design elements, and generation tasks, which focus on creating new design elements and layouts. The survey covers various subtasks, including visual element perception and generation, aesthetic and semantic understanding, layout analysis, and generation. We highlight the role of large language models and multimodal approaches in bridging the gap between localized visual features and global design intent. Despite significant progress, challenges remain to understanding human intent, ensuring interpretability, and maintaining control over multilayered compositions. This survey serves as a guide for researchers, providing information on the current state of AIGD and potential future directions\footnote{https://github.com/zhangtianer521/excellent\_Intelligent\_graphic\_design}.

89.9CVMar 16
Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen et al.

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

CVApr 30, 2024
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Kai Zhang, Sai Bi, Hao Tan et al.

We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: https://sai-bi.github.io/project/gs-lrm/ .

57.3CVApr 22
Linear Image Generation by Synthesizing Exposure Brackets

Yuekun Dai, Zhoutong Zhang, Shangchen Zhou et al.

The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.

CVSep 24, 2025Code
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju, Tianyu Wang, Yuqian Zhou et al.

Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

CVAug 16, 2021Code
Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation

Ailing Zeng, Xiao Sun, Lei Yang et al.

Various deep learning techniques have been proposed to solve the single-view 2D-to-3D pose estimation problem. While the average prediction accuracy has been improved significantly over the years, the performance on hard poses with depth ambiguity, self-occlusion, and complex or rare poses is still far from satisfactory. In this work, we target these hard poses and present a novel skeletal GNN learning solution. To be specific, we propose a hop-aware hierarchical channel-squeezing fusion layer to effectively extract relevant information from neighboring nodes while suppressing undesired noises in GNN learning. In addition, we propose a temporal-aware dynamic graph construction procedure that is robust and effective for 3D pose estimation. Experimental results on the Human3.6M dataset show that our solution achieves 10.3\% average prediction accuracy improvement and greatly improves on hard poses over state-of-the-art techniques. We further apply the proposed technique on the skeleton-based action recognition task and also achieve state-of-the-art performance. Our code is available at https://github.com/ailingzengzzz/Skeletal-GNN.

CVAug 5, 2021Code
Unifying Global-Local Representations in Salient Object Detection with Transformer

Sucheng Ren, Qiang Wen, Nanxuan Zhao et al.

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making that every output feature of transformer layers contributes uniquely for final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE). Code will be available at https://github.com/OliverRensu/GLSTR.

CVDec 11, 2025
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Peiying Zhang, Nanxuan Zhao, Matthew Fisher et al.

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

CVDec 10, 2024
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang et al.

We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.

CVMar 1, 2024
Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

Yuhao Liu, Zhanghan Ke, Fang Liu et al.

Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically, we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors, guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction, allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods, particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable, schedulable, and supports robust training across different dataset sizes.

CVMay 16, 2024
Text-to-Vector Generation with Neural Path Representation

Peiying Zhang, Nanxuan Zhao, Jing Liao

Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods directly optimize control points of vector graphics paths, often resulting in intersecting or jagged paths due to the lack of geometry constraints. To overcome these limitations, we propose a novel neural path representation by designing a dual-branch Variational Autoencoder (VAE) that learns the path latent space from both sequence and image modalities. By optimizing the combination of neural paths, we can incorporate geometric constraints while preserving expressivity in generated SVGs. Furthermore, we introduce a two-stage path optimization method to improve the visual and topological quality of generated SVGs. In the first stage, a pre-trained text-to-image diffusion model guides the initial generation of complex vector graphics through the Variational Score Distillation (VSD) process. In the second stage, we refine the graphics using a layer-wise image vectorization strategy to achieve clearer elements and structure. We demonstrate the effectiveness of our method through extensive experiments and showcase various applications. The project page is https://intchous.github.io/T2V-NPR.

CVDec 17, 2024
MotionBridge: Dynamic Video Inbetweening with Flexible Controls

Maham Tanveer, Yang Zhou, Simon Niklaus et al.

By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

CVApr 8, 2024
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Jing Gu, Nanxuan Zhao, Wei Xiong et al.

Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks. SwapAnything also achieves great performance on text-based swapping and tasks beyond swapping such as object insertion.

CVDec 13, 2024
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu et al.

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

CVDec 5, 2024
LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

Yusuf Dalva, Yijun Li, Qing Liu et al.

Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.

CVFeb 5, 2024
AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising

Maham Tanveer, Yizhi Wang, Ruiqi Wang et al.

We present AnaMoDiff, a novel diffusion-based method for 2D motion analogies that is applied to raw, unannotated videos of articulated characters. Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement, well preserved, even when there may be significant discrepancies between the source and driving characters in their part proportions and movement speed and styles. Our diffusion model transfers the input motion via a latent optical flow (LOF) network operating in a noised latent space, which is spatially aware, efficient to process compared to the original RGB videos, and artifact-resistant through the diffusion denoising process even amid dense movements. To accomplish both motion analogy and identity preservation, we train our denoising model in a feature-disentangled manner, operating at two noise levels. While identity-revealing features of the source are learned via conventional noise injection, motion features are learned from LOF-warped videos by only injecting noise with large values, with the stipulation that motion properties involving pose and limbs are encoded by higher-level features. Experiments demonstrate that our method achieves the best trade-off between motion analogy and identity preservation.

CVJan 10, 2025
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

Shengqi Dang, Yi He, Long Ling et al.

Recent research shows that emotions can enhance users' cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.

CVMay 15, 2025
3D-Fixup: Advancing Photo Editing with 3D Priors

Yen-Chi Cheng, Krishna Kumar Singh, Jae Shin Yoon et al.

Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/

CVJul 8, 2025
Rethinking Layered Graphic Design Generation with a Top-Down Approach

Jingye Chen, Zhaowen Wang, Nanxuan Zhao et al.

Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.

CVApr 3, 2025
Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

Junying Wang, Jingyuan Liu, Xin Sun et al.

This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

GRMay 15, 2025
Style Customization of Text-to-Vector Generation with Image Diffusion Priors

Peiying Zhang, Nanxuan Zhao, Jing Liao

Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics. Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is https://customsvg.github.io.

CVNov 21, 2025
Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri et al.

We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

CVOct 16, 2025
Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao et al.

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

CVOct 9, 2025
MultiCOIN: Multi-Modal COntrollable Video INbetweening

Maham Tanveer, Yang Zhou, Simon Niklaus et al.

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

CVJul 7, 2025
Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao et al.

Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

GRMar 20, 2025
Bézier Splatting for Fast and Differentiable Vector Graphics Rendering

Xi Liu, Chaoyi Zhou, Nanxuan Zhao et al.

Differentiable vector graphics (VGs) are widely used in image vectorization and vector synthesis, while existing representations are costly to optimize and struggle to achieve high-quality rendering results for high-resolution images. This work introduces a new differentiable VG representation, dubbed Bézier Splatting, that enables fast yet high-fidelity VG rasterization. Bézier Splatting samples 2D Gaussians along Bézier curves, which naturally provide positional gradients at object boundaries. Thanks to the efficient splatting-based differentiable rasterizer, Bézier Splatting achieves 30x and 150x faster per forward and backward rasterization step for open curves compared to DiffVG. Additionally, we introduce an adaptive pruning and densification strategy that dynamically adjusts the spatial distribution of curves to escape local minima, further improving VG quality. Furthermore, our new VG representation supports conversion to standard XML-based SVG format, enhancing interoperability with existing VG tools and pipelines. Experimental results show that Bézier Splatting significantly outperforms existing methods with better visual fidelity and significant optimization speedup.

LGFeb 21, 2025
Mean-Shift Distillation for Diffusion Mode Seeking

Vikas Thamizharasan, Nikitas Chatzis, Iliyan Georgiev et al.

We present mean-shift distillation, a novel diffusion distillation technique that provides a provably good proxy for the gradient of the diffusion output distribution. This is derived directly from mean-shift mode seeking on the distribution, and we show that its extrema are aligned with the modes. We further derive an efficient product distribution sampling procedure to evaluate the gradient. Our method is formulated as a drop-in replacement for score distillation sampling (SDS), requiring neither model retraining nor extensive modification of the sampling procedure. We show that it exhibits superior mode alignment as well as improved convergence in both synthetic and practical setups, yielding higher-fidelity results when applied to both text-to-image and text-to-3D applications with Stable Diffusion.

CVJun 13, 2024
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng et al.

In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework.

CVMay 29, 2023
Photoswap: Personalized Subject Swapping in Images

Jing Gu, Yilin Wang, Nanxuan Zhao et al.

In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. Photoswap first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of Photoswap in personalized subject swapping. Furthermore, Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.

CVMay 8, 2023
FashionTex: Controllable Virtual Try-on with Text and Texture

Anran Lin, Nanxuan Zhao, Shuliang Ning et al.

Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-modal interactive setting by combining the advantages of both text and texture for multi-level fashion manipulation. With the carefully designed fashion editing module and loss functions, FashionTex framework can semantically control cloth types and local texture patterns without annotated pairwise training data. We further introduce an ID recovery module to maintain the identity of input portrait. Extensive experiments have demonstrated the effectiveness of our proposed pipeline.

CVDec 10, 2021
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan et al.

Neural radiance fields (NeRF) has achieved outstanding performance in modeling 3D objects and controlled scenes, usually under a single scale. In this work, we focus on multi-scale cases where large changes in imagery are observed at drastically different scales. This scenario vastly exists in real-world 3D environments, such as city scenes, with views ranging from satellite level that captures the overview of a city, to ground level imagery showing complex details of an architecture; and can also be commonly identified in landscape and delicate minecraft 3D models. The wide span of viewing positions within these scenes yields multi-scale renderings with very different levels of detail, which poses great challenges to neural radiance field and biases it towards compromised results. To address these issues, we introduce BungeeNeRF, a progressive neural radiance field that achieves level-of-detail rendering across drastically varied scales. Starting from fitting distant views with a shallow base block, as training progresses, new blocks are appended to accommodate the emerging details in the increasingly closer views. The strategy progressively activates high-frequency channels in NeRF's positional encoding inputs and successively unfolds more complex details as the training proceeds. We demonstrate the superiority of BungeeNeRF in modeling diverse multi-scale scenes with drastically varying views on multiple data sources (city models, synthetic, and drone captured data) and its support for high-quality rendering in different levels of detail.

CVJun 11, 2020
What makes instance discrimination good for transfer learning?

Nanxuan Zhao, Zhirong Wu, Rynson W. H. Lau et al.

Contrastive visual pretraining based on the instance discrimination pretext task has made significant progress. Notably, recent work on unsupervised pretraining has shown to surpass the supervised counterpart for finetuning downstream applications such as object detection and segmentation. It comes as a surprise that image annotations would be better left unused for transfer learning. In this work, we investigate the following problems: What makes instance discrimination pretraining good for transfer learning? What knowledge is actually learned and transferred from these models? From this understanding of instance discrimination, how can we better exploit human annotation labels for pretraining? Our findings are threefold. First, what truly matters for the transfer is low-level and mid-level representations, not high-level representations. Second, the intra-category invariance enforced by the traditional supervised model weakens transferability by increasing task misalignment. Finally, supervised pretraining can be strengthened by following an exemplar-based approach without explicit constraints among the instances within the same category.

CVApr 14, 2020
Distilling Localization for Self-Supervised Representation Learning

Nanxuan Zhao, Zhirong Wu, Rynson W. H. Lau et al.

Recent progress in contrastive learning has revolutionized unsupervised representation learning. Concretely, multiple views (augmentations) from the same image are encouraged to map to the similar embeddings, while views from different images are pulled apart. In this paper, through visualizing and diagnosing classification errors, we observe that current contrastive models are ineffective at localizing the foreground object, limiting their ability to extract discriminative high-level features. This is due to the fact that view generation process considers pixels in an image uniformly. To address this problem, we propose a data-driven approach for learning invariance to backgrounds. It first estimates foreground saliency in images and then creates augmentations by copy-and-pasting the foreground onto a variety of backgrounds. The learning still follows the instance discrimination pretext task, so that the representation is trained to disregard background content and focus on the foreground. We study a variety of saliency estimation methods, and find that most methods lead to improvements for contrastive learning. With this approach (DiLo), significant performance is achieved for self-supervised learning on ImageNet classification, and also for object detection on PASCAL VOC and MSCOCO.