LG CLMay 30, 2023

Universality and Limitations of Prompt Tuning

Yihan Wang, Jatin Chauhan, Wei Wang, Cho-Jui Hsieh

arXiv:2305.18787v222.336 citations

Originality Incremental advance

AI Analysis

This work addresses the theoretical understanding of prompt tuning versus weight tuning for researchers in machine learning, though it is incremental as it builds on existing empirical studies.

The paper theoretically analyzes prompt tuning for transformers, proving that a sufficiently strong transformer with a prompt can approximate any Lipschitz sequence-to-sequence function, but also showing limitations where finite-depth transformers cannot memorize certain datasets or require many prompt parameters.

Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between "tuning parameters before the input" against "the tuning of model weights" are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on LoRA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.

View on arXiv PDF

Similar