Length Generalization Bounds for Transformers
This addresses a foundational theoretical problem for machine learning researchers, showing inherent limitations in guaranteeing generalization for transformers, which is incremental on prior partial results.
The paper tackled the open problem of computing length generalization bounds for CRASP, a language class linked to transformers, by proving that computable bounds do not exist for CRASP with two or more layers, but providing a computable exponential bound for positive CRASP and fixed-precision transformers, with optimality shown.
Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for CRASP, a class of languages which is closely linked to transformers. A positive partial result was recently shown by Chen et al. for CRASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for CRASP (already with two layers) and hence for transformers. To complement this, we provide a computable bound for the positive fragment of CRASP, which we show equivalent to fixed-precision transformers. For both positive CRASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.