MLLGDec 10, 2025

Transformers for Tabular Data: A Training Perspective of Self-Attention via Optimal Transport

arXiv:2512.09530v1h-index: 24
Originality Incremental advance
AI Analysis

This work addresses training inefficiencies in Transformers for tabular classification, offering a more computationally efficient alternative, though it is incremental as it builds on existing OT and MLP methods.

The study tackled the inefficiency of training self-attention for tabular data by analyzing it through Optimal Transport (OT) and proposing an OT-based algorithm that generates class-specific dummy distributions and trains an MLP to align with the data, achieving accuracy comparable to Transformers while reducing computational cost and improving scalability under standardized inputs.

This thesis examines self-attention training through the lens of Optimal Transport (OT) and develops an OT-based alternative for tabular classification. The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution using discrete OT metrics, including Wasserstein distance, Monge gap, optimality, and efficiency. Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset. Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient. Pretraining the MLP section on synthetic data partially improves convergence but is sensitive to their initialization. To address these limitations, an OT-based algorithm is introduced: it generates class-specific dummy Gaussian distributions, computes an OT alignment with the data, and trains an MLP to generalize this mapping. The method achieves accuracy comparable to Transformers while reducing computational cost and scaling more efficiently under standardized inputs, though its performance depends on careful dummy-geometry design. All experiments and implementations are conducted in R.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes