CL IRJul 1, 2025

Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

Mohna Chakraborty, Adithya Kulkarni, Qi Li

arXiv:2507.00330v12.7h-index: 5PAKDD

Originality Incremental advance

AI Analysis

This addresses the problem of improving few-shot learning in NLP for cold-start settings, but it is incremental as it builds on existing prompt-based methods.

The paper tackles the sensitivity of prompt-based methods to template, verbalizer, and instance selection in cold-start scenarios with no labeled data by proposing COLDSELECT, a joint selection approach that models data diversity, resulting in outperforming baselines on eight benchmarks.

Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.

View on arXiv PDF

Similar