On the Expressive Power of Contextual Relations in Transformers
This work provides a foundational theoretical understanding of transformers, which is significant for researchers in machine learning and natural language processing, though it is incremental as it builds on existing transformer concepts.
The authors tackled the problem of mathematically characterizing the expressive power of transformers for contextual relationships in natural language by introducing a measure-theoretic framework and Sinkhorn Transformer, resulting in a universal approximation theorem showing that any continuous coupling function can be uniformly approximated by this architecture.
Transformer architectures have achieved remarkable empirical success in modeling contextual relationships in natural language, yet a precise mathematical characterization of their expressive power remains incomplete. In this work, we introduce a measure-theoretic framework for contextual representations in which texts are modeled as probability measures over a semantic embedding space, and contextual relations between words, are represented as coupling measures between them. Within this setting, we introduce Sinkhorn Transformer, a transformer-like architecture. Our main result is a universal approximation theorem: any continuous coupling function between probability measures, that encodes the semantic relation coupling measure, can be uniformly approximated by a Sinkhorn Transformer with appropriate parameters.