Tong-Yee Lee

CV
h-index36
26papers
568citations
Novelty56%
AI Score59

26 Papers

CVMar 20, 2023Code
AnimeDiffusion: Anime Face Line Drawing Colorization via Diffusion Models

Yu Cao, Xiangqiao Meng, P. Y. Mok et al.

It is a time-consuming and tedious work for manually colorizing anime line drawing images, which is an essential stage in cartoon animation creation pipeline. Reference-based line drawing colorization is a challenging task that relies on the precise cross-domain long-range dependency modelling between the line drawing and reference image. Existing learning methods still utilize generative adversarial networks (GANs) as one key module of their model architecture. In this paper, we propose a novel method called AnimeDiffusion using diffusion models that performs anime face line drawing colorization automatically. To the best of our knowledge, this is the first diffusion model tailored for anime content creation. In order to solve the huge training consumption problem of diffusion models, we design a hybrid training strategy, first pre-training a diffusion model with classifier-free guidance and then fine-tuning it with image reconstruction guidance. We find that with a few iterations of fine-tuning, the model shows wonderful colorization performance, as illustrated in Fig. 1. For training AnimeDiffusion, we conduct an anime face line drawing colorization benchmark dataset, which contains 31696 training data and 579 testing data. We hope this dataset can fill the gap of no available high resolution anime face dataset for colorization method evaluation. Through multiple quantitative metrics evaluated on our dataset and a user study, we demonstrate AnimeDiffusion outperforms state-of-the-art GANs-based models for anime face line drawing colorization. We also collaborate with professional artists to test and apply our AnimeDiffusion for their creation work. We release our code on https://github.com/xq-meng/AnimeDiffusion.

CVMay 19, 2022Code
Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning

Yuxin Zhang, Fan Tang, Weiming Dong et al.

In this work, we tackle the challenging problem of arbitrary image style transfer using a novel style feature representation learning method. A suitable style representation, as a key component in image stylization tasks, is essential to achieve satisfactory results. Existing deep neural network based approaches achieve reasonable results with the guidance from second-order statistics such as Gram matrix of content features. However, they do not leverage sufficient style information, which results in artifacts such as local distortions and style inconsistency. To address these issues, we propose to learn style representation directly from image features instead of their second-order statistics, by analyzing the similarities and differences between multiple styles and considering the style distribution. Specifically, we present Contrastive Arbitrary Style Transfer (CAST), which is a new style representation learning and style transfer method via contrastive learning. Our framework consists of three key components, i.e., a multi-layer style projector for style code encoding, a domain enhancement module for effective learning of style distribution, and a generative network for image style transfer. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results compared to those obtained via state-of-the-art methods. Code and models are available at https://github.com/zyxElsa/CAST_pytorch

CVFeb 23, 2023Code
Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Nisha Huang, Fan Tang, Weiming Dong et al.

Image manipulation under the guidance of textual descriptions has recently received a broad range of attention. In this study, we focus on the regional editing of images with the guidance of given text prompts. Different from current mask-based image editing methods, we propose a novel region-aware diffusion model (RDM) for entity-level image editing, which could automatically locate the region of interest and replace it following given text prompts. To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline by combing latent space diffusion and enhanced directional guidance. In addition, to preserve image content in non-edited regions, we introduce regional-aware entity editing to modify the region of interest and preserve the out-of-interest region. We validate the proposed RDM beyond the baseline methods through extensive qualitative and quantitative experiments. The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency. The codes are available at https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model.

CVMar 9, 2023
A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive Learning

Yuxin Zhang, Fan Tang, Weiming Dong et al.

We present Unified Contrastive Arbitrary Style Transfer (UCAST), a novel style representation learning and transfer framework, which can fit in most existing arbitrary image style transfer models, e.g., CNN-based, ViT-based, and flow-based methods. As the key component in image style transfer tasks, a suitable style representation is essential to achieve satisfactory results. Existing approaches based on deep neural network typically use second-order statistics to generate the output. However, these hand-crafted features computed from a single image cannot leverage style information sufficiently, which leads to artifacts such as local distortions and style inconsistency. To address these issues, we propose to learn style representation directly from a large amount of images based on contrastive learning, by taking the relationships between specific styles and the holistic style distribution into account. Specifically, we present an adaptive contrastive learning scheme for style transfer by introducing an input-dependent temperature. Our framework consists of three key components, i.e., a parallel contrastive learning scheme for style representation and style transfer, a domain enhancement module for effective learning of style distribution, and a generative network for style transfer. We carry out qualitative and quantitative evaluations to show that our approach produces superior results than those obtained via state-of-the-art methods.

CVNov 17, 2023
Image Collage on Arbitrary Shape via Shape-Aware Slicing and Optimization

Dong-Yi Wu, Thi-Ngoc-Hanh Le, Sheng-Yi Yao et al.

Image collage is a very useful tool for visualizing an image collection. Most of the existing methods and commercial applications for generating image collages are designed on simple shapes, such as rectangular and circular layouts. This greatly limits the use of image collages in some artistic and creative settings. Although there are some methods that can generate irregularly-shaped image collages, they often suffer from severe image overlapping and excessive blank space. This prevents such methods from being effective information communication tools. In this paper, we present a shape slicing algorithm and an optimization scheme that can create image collages of arbitrary shapes in an informative and visually pleasing manner given an input shape and an image collection. To overcome the challenge of irregular shapes, we propose a novel algorithm, called Shape-Aware Slicing, which partitions the input shape into cells based on medial axis and binary slicing tree. Shape-Aware Slicing, which is designed specifically for irregular shapes, takes human perception and shape structure into account to generate visually pleasing partitions. Then, the layout is optimized by analyzing input images with the goal of maximizing the total salient regions of the images. To evaluate our method, we conduct extensive experiments and compare our results against previous work. The evaluations show that our proposed algorithm can efficiently arrange image collections on irregular shapes and create visually superior results than prior work and existing commercial tools.

90.0GRMay 15Code
DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer

Nisha Huang, Yizhou Lin, Jie Guo et al.

Recently, diffusion-based material transfer methods rely on image fine-tuning or complex architectures with auxiliary networks but face challenges such as text dependency, additional computational costs, and feature misalignment. To address these limitations, we propose \textbf{DealMaTe}, using \underline{\textbf{de}}pth, norm\underline{\textbf{a}}l, and \underline{\textbf{l}}ighting images for \underline{\textbf{ma}}terial \underline{\textbf{t}}ransf\underline{\textbf{e}}r. DealMaTe is a simplified diffusion framework that eliminates text guidance and reference networks. We design a lightweight 3D information injection method, Multi-Dim 3D Shader LoRA, which, without modifying the base model weights, enables compatible control conditions and achieves harmonious and stable results. Additionally, we optimize the attention mechanism with Shader Causal Mutual Attention and key-value (KV) caching to reduce inference latency caused by multiple conditions, improve computational efficiency, and achieve high-quality material transfer results with low architectural complexity. Extensive experiments covering a wide variety of objects and lighting conditions consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer under arbitrary input materials. The code is available at https://github.com/haha-lisa/DealMaTe.

CVNov 13, 2023
Regenerating Arbitrary Video Sequences with Distillation Path-Finding

Thi-Ngoc-Hanh Le, Sheng-Yi Yao, Chun-Te Wu et al.

If the video has long been mentioned as a widespread visualization form, the animation sequence in the video is mentioned as storytelling for people. Producing an animation requires intensive human labor from skilled professional artists to obtain plausible animation in both content and motion direction, incredibly for animations with complex content, multiple moving objects, and dense movement. This paper presents an interactive framework to generate new sequences according to the users' preference on the starting frame. The critical contrast of our approach versus prior work and existing commercial applications is that novel sequences with arbitrary starting frame are produced by our system with a consistent degree in both content and motion direction. To achieve this effectively, we first learn the feature correlation on the frameset of the given video through a proposed network called RSFNet. Then, we develop a novel path-finding algorithm, SDPF, which formulates the knowledge of motion directions of the source video to estimate the smooth and plausible sequences. The extensive experiments show that our framework can produce new animations on the cartoon and natural scenes and advance prior works and commercial applications to enable users to obtain more predictable results.

70.9CVApr 18
Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation

Minyan Luo, Yuxin Zhang, Yifei Li et al.

Narrative-driven product photography has become a prevalent paradigm in modern marketing, as coherent visual storytelling helps convey product value and establishes emotional engagement with consumers. However, existing image generation methods do not support structured narrative planning or cross-panel coordination, often resulting in weak storytelling and visual incoherence. In practice, narrative product photography is commonly presented as multi-grid collages, where multiple views or scenes jointly communicate a product narrative. To ensure visual consistency across grids and aesthetic harmony of the overall composition, we generate the collage as a single unified image rather than composing independently synthesized panels. We propose a self-reasoning agentic framework for narrative product grid collage generation. Given a product packshot and its name, the system first constructs a Product Narrative Framework that explicitly represents the product's identity, usage context, and situational environment, and translates it into complementary grids governed by a shared visual style. Constraint-aware prompts are then compiled and fed to a generation model that synthesizes the collage jointly. The generated output is evaluated on both content validity and photography quality, with explicit gates determining whether to proceed or refine. When evaluation fails, the system performs failure attribution and applies targeted refinement, enabling progressive improvement through iterative self-reflection. Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines.

CVMar 25, 2024Code
Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Ziyao Huang, Fan Tang, Yong Zhang et al.

Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods. Project page: \url{https://github.com/ICTMCG/Make-Your-Anchor}.

99.4CVMar 26
TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts

Yu Xu, Hongbin Yan, Juan Cao et al.

Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference. In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.

77.3CVMay 15
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

Nisha Huang, Henglin Liu, Yizhou Lin et al.

Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.

48.8ROMar 16
GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions

Haifeng Zhong, Wenshuo Han, Zhouyu Wang et al.

Achieving accurate garment grasping under dynamically changing illumination is crucial for all-day operation of service robots.However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness.Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model's adaptability to illumination changes when utilizing multimodal information.To address this problem, we propose GraspALL, an illumination-structure interactive compensation model.The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature fusion between RGB and non-RGB modalities according to varying lighting intensities, thereby generating illumination-consistent grasping representations.Experiments on the self-built garment grasping dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baselines under diverse illumination conditions.

CVDec 29, 2025
Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

Henglin Liu, Nisha Huang, Chang Liu et al.

The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.

CVNov 8, 2023
Retargeting video with an end-to-end framework

Thi-Ngoc-Hanh Le, HuiGuang Huang, Yi-Ru Chen et al.

Video holds significance in computer graphics applications. Because of the heterogeneous of digital devices, retargeting videos becomes an essential function to enhance user viewing experience in such applications. In the research of video retargeting, preserving the relevant visual content in videos, avoiding flicking, and processing time are the vital challenges. Extending image retargeting techniques to the video domain is challenging due to the high running time. Prior work of video retargeting mainly utilizes time-consuming preprocessing to analyze frames. Plus, being tolerant of different video content, avoiding important objects from shrinking, and the ability to play with arbitrary ratios are the limitations that need to be resolved in these systems requiring investigation. In this paper, we present an end-to-end RETVI method to retarget videos to arbitrary aspect ratios. We eliminate the computational bottleneck in the conventional approaches by designing RETVI with two modules, content feature analyzer (CFA) and adaptive deforming estimator (ADE). The extensive experiments and evaluations show that our system outperforms previous work in quality and running time. Visit our project website for more results at http://graphics.csie.ncku.edu.tw/RETVI.

29.6CVMay 12
Optimizing 4D Wires for Sparse 3D Abstraction

Dong-Yi Wu, Tong-Yee Lee

We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.

CVJan 25, 2024Code
CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

Nisha Huang, Weiming Dong, Yuxin Zhang et al.

Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression. Code and results are available at https://github.com/haha-lisa/CreativeSynth.

GRMay 25, 2023Code
ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

Yuxin Zhang, Weiming Dong, Fan Tang et al.

Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called \sysname. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available athttps://github.com/zyxElsa/ProSpect.

CVNov 22, 2024
HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads

Yu Xu, Fan Tang, Juan Cao et al.

Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.

CVMar 28, 2024
Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization

Yu Xu, Fan Tang, Juan Cao et al.

Personalized generation paradigms empower designers to customize visual intellectual properties with the help of textual descriptions by tuning or adapting pre-trained text-to-image models on a few images. Recent works explore approaches for concurrently customizing both content and detailed visual style appearance. However, these existing approaches often generate images where the content and style are entangled. In this study, we reconsider the customization of content and style concepts from the perspective of parameter space construction. Unlike existing methods that utilize a shared parameter space for content and style, we propose a learning framework that separates the parameter space to facilitate individual learning of content and style, thereby enabling disentangled content and style. To achieve this goal, we introduce "partly learnable projection" (PLP) matrices to separate the original adapters into divided sub-parameter spaces. We propose "break-for-make" customization learning pipeline based on PLP, which is simple yet effective. We break the original adapters into "up projection" and "down projection", train content and style PLPs individually with the guidance of corresponding textual prompts in the separate adapters, and maintain generalization by employing a multi-correspondence projection learning strategy. Based on the adapters broken apart for separate training content and style, we then make the entity parameter space by reconstructing the content and style PLPs matrices, followed by fine-tuning the combined adapter to generate the target object with the desired appearance. Experiments on various styles, including textures, materials, and artistic style, show that our method outperforms state-of-the-art single/multiple concept learning pipelines in terms of content-style-prompt alignment.

CVJan 3, 2025
ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing

Nisha Huang, Kaer Huang, Yifan Pu et al.

Recent years have witnessed significant advancements in text-guided style transfer, primarily attributed to innovations in diffusion models. These models excel in conditional guidance, utilizing text or images to direct the sampling process. However, despite their capabilities, direct conditional guidance approaches often face challenges in balancing the expressiveness of textual semantics with the diversity of output results while capturing stylistic features. To address these challenges, we introduce ArtCrafter, a novel framework for text-to-image style transfer. Specifically, we introduce an attention-based style extraction module, meticulously engineered to capture the subtle stylistic elements within an image. This module features a multi-layer architecture that leverages the capabilities of perceiver attention mechanisms to integrate fine-grained information. Additionally, we present a novel text-image aligning augmentation component that adeptly balances control over both modalities, enabling the model to efficiently map image and text embeddings into a shared feature space. We achieve this through attention operations that enable smooth information flow between modalities. Lastly, we incorporate an explicit modulation that seamlessly blends multimodal enhanced embeddings with original embeddings through an embedding reframing design, empowering the model to generate diverse outputs. Extensive experiments demonstrate that ArtCrafter yields impressive results in visual stylization, exhibiting exceptional levels of stylistic intensity, controllability, and diversity.

CVMay 24, 2024
GSDeformer: Direct, Real-time and Extensible Cage-based Deformation for 3D Gaussian Splatting

Jiajun Huang, Shuolin Xu, Hongchuan Yu et al.

We present GSDeformer, a method that enables cage-based deformation on 3D Gaussian Splatting (3DGS). Our approach bridges cage-based deformation and 3DGS by using a proxy point-cloud representation. This point cloud is generated from 3D Gaussians, and deformations applied to the point cloud are translated into transformations on the 3D Gaussians. To handle potential bending caused by deformation, we incorporate a splitting process to approximate it. Our method does not modify or extend the core architecture of 3D Gaussian Splatting, making it compatible with any trained vanilla 3DGS or its variants. Additionally, we automate cage construction for 3DGS and its variants using a render-and-reconstruct approach. Experiments demonstrate that GSDeformer delivers superior deformation results compared to existing methods, is robust under extreme deformations, requires no retraining for editing, runs in real-time, and can be extended to other 3DGS variants. Project Page: https://jhuangbu.github.io/gsdeformer/

CVFeb 1
Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Yu Xu, Yuxin Zhang, Juan Cao et al.

A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the "creative essence" from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar ("G"). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.

CVJan 26, 2025
IP-Prompter: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting

Yuxin Zhang, Minyan Luo, Weiming Dong et al.

The stories and characters that captivate us as we grow up shape unique fantasy worlds, with images serving as the primary medium for visually experiencing these realms. Personalizing generative models through fine-tuning with theme-specific data has become a prevalent approach in text-to-image generation. However, unlike object customization, which focuses on learning specific objects, theme-specific generation encompasses diverse elements such as characters, scenes, and objects. Such diversity also introduces a key challenge: how to adaptively generate multi-character, multi-concept, and continuous theme-specific images (TSI). Moreover, fine-tuning approaches often come with significant computational overhead, time costs, and risks of overfitting. This paper explores a fundamental question: Can image generation models directly leverage images as contextual input, similarly to how large language models use text as context? To address this, we present IP-Prompter, a novel training-free TSI generation method. IP-Prompter introduces visual prompting, a mechanism that integrates reference images into generative models, allowing users to seamlessly specify the target theme without requiring additional training. To further enhance this process, we propose a Dynamic Visual Prompting (DVP) mechanism, which iteratively optimizes visual prompts to improve the accuracy and quality of generated images. Our approach enables diverse applications, including consistent story generation, character design, realistic character generation, and style-guided image generation. Comparative evaluations against state-of-the-art personalization methods demonstrate that IP-Prompter achieves significantly better results and excels in maintaining character identity preserving, style consistency and text alignment, offering a robust and flexible solution for theme-specific image generation.

CVNov 23, 2024
Interactive Visual Assessment for Text-to-Image Generation Models

Xiaoyue Mi, Fan Tang, Juan Cao et al.

Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.

CVFeb 12, 2018
Image Retargetability

Fan Tang, Weiming Dong, Yiping Meng et al.

Real-world applications could benefit from the ability to automatically retarget an image to different aspect ratios and resolutions, while preserving its visually and semantically important content. However, not all images can be equally well processed that way. In this work, we introduce the notion of image retargetability to describe how well a particular image can be handled by content-aware image retargeting. We propose to learn a deep convolutional neural network to rank photo retargetability in which the relative ranking of photo retargetability is directly modeled in the loss function. Our model incorporates joint learning of meaningful photographic attributes and image content information which can help regularize the complicated retargetability rating problem. To train and analyze this model, we have collected a database which contains retargetability scores and meaningful image attributes assigned by six expert raters. Experiments demonstrate that our unified model can generate retargetability rankings that are highly consistent with human labels. To further validate our model, we show applications of image retargetability in retargeting method selection, retargeting method assessment and photo collage generation.

GRMar 26, 2014
Image Retargeting by Content-Aware Synthesis

Weiming Dong, Fuzhang Wu, Yan Kong et al.

Real-world images usually contain vivid contents and rich textural details, which will complicate the manipulation on them. In this paper, we design a new framework based on content-aware synthesis to enhance content-aware image retargeting. By detecting the textural regions in an image, the textural image content can be synthesized rather than simply distorted or cropped. This method enables the manipulation of textural & non-textural regions with different strategy since they have different natures. We propose to retarget the textural regions by content-aware synthesis and non-textural regions by fast multi-operators. To achieve practical retargeting applications for general images, we develop an automatic and fast texture detection method that can detect multiple disjoint textural regions. We adjust the saliency of the image according to the features of the textural regions. To validate the proposed method, comparisons with state-of-the-art image targeting techniques and a user study were conducted. Convincing visual results are shown to demonstrate the effectiveness of the proposed method.