MLLGApr 11, 2025

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

arXiv:2504.08638v16 citationsh-index: 5ICLR
Originality Incremental advance
AI Analysis

This provides theoretical insights into transformer mechanisms for variable selection in group-sparse classification, which is incremental to existing empirical successes.

The paper tackles the problem of understanding how transformers can learn structured data with group sparsity, showing theoretically that a one-layer transformer trained by gradient descent can correctly select relevant variables for classification and adapt to new tasks with good accuracy using limited samples.

Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes