Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior
This work addresses the need for more efficient and accurate generative models in machine learning, though it appears incremental as it builds on existing transformer and optimal transport frameworks.
The authors tackled the problem of designing sparse transformer architectures by incorporating prior information about data distributions through a regularized Wasserstein proximal operator, which improved convexity and sparsity. They demonstrated that this approach achieves higher accuracy and faster convergence to target distributions compared to classical neural ODE-based methods in generative modeling and Bayesian inverse problems.
In this work, we propose a sparse transformer architecture that incorporates prior information about the underlying data distribution directly into the transformer structure of the neural network. The design of the model is motivated by a special optimal transport problem, namely the regularized Wasserstein proximal operator, which admits a closed-form solution and turns out to be a special representation of transformer architectures. Compared with classical flow-based models, the proposed approach improves the convexity properties of the optimization problem and promotes sparsity in the generated samples. Through both theoretical analysis and numerical experiments, including applications in generative modeling and Bayesian inverse problems, we demonstrate that the sparse transformer achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.