Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning
This addresses the problem of unstable in-context learning for language model users, offering a novel method that is incremental over prior work like Knn Prompting.
The paper tackles the instability of in-context learning in language models by proposing a clustering method based on vocabulary-defined semantics, which improves performance by 3%-49% on average and reduces computation time by half.
In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations, which in turn exacerbates the difficulty of optimization. Prior work, such as Knn Prompting, index samples based on the similarities of logits at the output-side, in addition to the regular retrieval operation at the input-side. They improve in-context learning by leveraging the core ability of next-token prediction, rather than relying solely on the emergent capacity to make analogies. Despite this, the hard-to-optimize issue of in-context learning still exists. In our view, it stems from the process of selecting demonstrations. To address this, we propose complementing in-context learning with an additional clustering operation. We propose a novel approach "vocabulary-defined semantics". Grounded in LM vocabulary, which is the label space of model outputs, the proposed approach computes semantically equivalent latent representations for output labels. Then, taking the representations as centroids, a clustering operation is performed to align the semantic properties between the language model and the downstream data/tasks. Based on extensive experiments across diverse textual understanding datasets and multiple models, our approach outperforms the state-of-the-art in terms of effectiveness and efficiency. On average, it achieves $3\%-49\%$ improvements while requiring only half of the computation time.