LGMay 26

Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning

arXiv:2605.2725964.8
AI Analysis

This work provides a theoretical unification of several Transformer variants and a new self-conditioning mechanism, offering a novel perspective for researchers working on Transformer architectures and self-supervised learning.

The paper proposes Kan Extension Transformers (KETs) as a unifying categorical framework for Transformer architectures, showing that attention, diffusion, and self-conditioning can be viewed as instances of a weighted structured extension operator. In strict-causal settings, quadratic KET achieves the best perplexity on WikiText-2 and WikiText-103 among compared causal architectures, but the largest gains come from the predict-detach self-conditioning regime rather than architectural changes.

We propose Kan Extension Transformers (KETs) as a unifying categorical framework for a diverse group of Transformer implementations. The core claim is that a Transformer layer can be viewed as a weighted structured extension operator: standard attention is the singleton-neighborhood case, Geometric Transformer style incidence mixing is a sparse edge-restricted case, and KET is the higher-order simplicial case. This lens also clarifies a bridge to diffusion-style completion. When the extension operator acts on detached predictive carriers instead of teacher-forced hidden states, it becomes a valid self-conditioning mechanism that exposes noncausal structure without leaking gold future tokens. We include a comprehensive experimental validation of 12 different Transformer implementations varying across strict-causal and predict-detach regimes on Penn Treebank, WikiText-2, and WikiText-103. In the strict-causal setting, quadratic KET is the strongest model among the compared causal architectures on WikiText-2 and WikiText-103. Across all datasets, however, the largest gains come from the predict-detach regime rather than from changing the neighborhood family alone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes