Consistent Accelerated Inference via Confident Adaptive Transformers
This addresses computational inefficiency in NLP for users of large Transformers, though it appears incremental as it builds on existing acceleration methods with added guarantees.
The paper tackles the problem of unpredictable performance costs in accelerated inference for large Transformers by introducing Confident Adaptive Transformers (CATs), which dynamically stop computation for each input using a meta consistency classifier and conformal prediction, achieving guaranteed consistency with the original model while increasing efficiency.
We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.