CVFeb 3

PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation

arXiv:2602.03220v1
AI Analysis

This work addresses a domain-specific problem for users of text-to-image diffusion models who need stable and consistent stylized character generation without external references, representing an incremental improvement over existing adapter-based approaches.

The paper tackles the problem of reference-free style-conditioned character generation in text-to-image diffusion models, where existing methods suffer from style drift and geometric inconsistency or rely on external images. The proposed PokeFusion Attention method improves style fidelity, semantic alignment, and character shape consistency on a Pokemon-style benchmark while maintaining low parameter overhead and inference-time simplicity.

This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment flexibility.We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully frozen.PokeFusion Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different backbones.Experiments on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes