From Kernels to Attention: A Transformer Framework for Density and Score Estimation
This provides a unified, distribution-agnostic framework for nonparametric density and score estimation, which is incremental but offers practical improvements over classical approaches.
The authors tackled the problem of joint score and density estimation by developing a permutation- and affine-equivariant transformer that directly estimates both from i.i.d. samples, achieving substantially lower error and better scaling than traditional methods like KDE and SD-KDE.
We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density $f(x)$ and its score $\nabla_x \log f(x)$ directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.