Evaluating the Effectiveness of Black-Box Prompt Optimization as the Scale of LLMs Continues to Grow
This addresses the problem of optimizing prompts for increasingly large LLMs, revealing an inverse scaling law that could impact practitioners relying on these methods, though it is incremental in nature.
The study investigated whether black-box prompt optimization methods remain effective as large language models (LLMs) scale up, finding that they offer only limited improvements on large-scale models like DeepSeek V3 and Gemini 2.0 Flash, with effectiveness diminishing as model size increases.
Black-Box prompt optimization methods have emerged as a promising strategy for refining input prompts to better align large language models (LLMs), thereby enhancing their task performance. Although these methods have demonstrated encouraging results, most studies and experiments have primarily focused on smaller-scale models (e.g., 7B, 14B) or earlier versions (e.g., GPT-3.5) of LLMs. As the scale of LLMs continues to increase, such as with DeepSeek V3 (671B), it remains an open question whether these black-box optimization techniques will continue to yield significant performance improvements for models of such scale. In response to this, we select three well-known black-box optimization methods and evaluate them on large-scale LLMs (DeepSeek V3 and Gemini 2.0 Flash) across four NLU and NLG datasets. The results show that these black-box prompt optimization methods offer only limited improvements on these large-scale LLMs. Furthermore, we hypothesize that the scale of the model is the primary factor contributing to the limited benefits observed. To explore this hypothesis, we conducted experiments on LLMs of varying sizes (Qwen 2.5 series, ranging from 7B to 72B) and observed an inverse scaling law, wherein the effectiveness of black-box optimization methods diminished as the model size increased.