53.3CVMay 10Code
Attention Sinks in Diffusion Transformers: A Causal AnalysisFangzheng Wu, Brian Summa
Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}
CVJan 23
Model-Centric Diagnostics: A Framework for Internal State ReadoutsFangzheng Wu, Brian Summa
We present a model-centric diagnostic framework that treats training state as a latent variable and unifies a family of internal readouts -- head-gradient norms, confidence, entropy, margin, and related signals -- as anchor-relative projections of that state. A preliminary version of this work introduced a head-gradient probe for checkpoint selection. In this version, we focus on the unifying perspective and structural diagnostics; full algorithmic details, theoretical analysis, and experimental validation will appear in a forthcoming paper. We outline the conceptual scaffold: any prediction head induces a local loss landscape whose geometry (gradient magnitude, curvature, sharpness) reflects how well the upstream features are aligned with the task. Different readout choices -- gradient norms, softmax entropy, predictive margin -- correspond to different projections of this geometry, each with complementary strengths. The framework suggests that checkpoint selection, early stopping, and lightweight architecture pre-screening can all be viewed as querying the same underlying state through different lenses. Illustrative experiments on ImageNet classification and COCO detection/segmentation hint at the practical potential; rigorous benchmarks and ablations are deferred to the full paper.
35.5CVMay 3
SteeringDiffusion: A Bottlenecked Activation Control Interface for Diffusion ModelsFangzheng Wu, Brian Summa
We introduce SteeringDiffusion, a bottlenecked activation-level control interface for diffusion models that exposes a smooth, monotonic, and runtime-adjustable control surface over the content--style trade-off. Our method keeps the U-Net backbone frozen and learns a small, prompt-conditioned latent code projected to FiLM/AdaGN-style modulation parameters. A zero-initialized design guarantees exact equivalence to the base model at zero scale, while timestep-aware gating restricts modulation to later denoising stages. A single scalar at inference continuously traverses the control surface without retraining. Across experiments on Stable Diffusion~1.5 and SDXL covering multiple artistic styles, we show that SteeringDiffusion produces smooth and monotonic content--style trade-offs. Under matched parameter budgets, it outperforms LoRA in controllability and stability, while ControlNet and rank-1 adapters do not expose a comparable control surface. We further introduce an inversion-stability diagnostic based on DDIM inversion, used as a post-hoc trajectory probe, which reveals strong correlations with intervention magnitude. These results position \emph{Steering Bottlenecked Explicit Control (S-BEC)} as a practical, general-purpose control interface for frozen diffusion backbones.
72.2LGApr 6
CPT: Controllable and Editable Design Variations with Language ModelsKarthik Suresh, Amine Ben Khalifa, Li Zhang et al.
Designing visually diverse and high-quality designs remains a manual, time-consuming process, limiting scalability and personalization in creative workflows. We present a system for generating editable design variations using a decoder-only language model, the Creative Pre-trained Transformer (CPT), trained to predict visual style attributes in design templates. At the core of our approach is a new representation called Creative Markup Language (CML), a compact, machine-learning-friendly format that captures canvas-level structure, page layout, and element-level details (text, images, and vector graphics), including both content and style. We fine-tune CPT on a large corpus of design templates authored by professional designers, enabling it to learn meaningful, context-aware predictions for attributes such as color schemes and font choices. The model produces semantically structured and stylistically coherent outputs, preserving internal consistency across elements. Unlike generative image models, our system yields fully editable design documents rather than pixel-only images, allowing users to iterate and personalize within a design editor. In experiments, our approach generates contextual color and font variations for existing templates and shows promise in adjusting layouts while maintaining design principles.