Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
This addresses the challenge of prompt quality and security for LLM users, though it is incremental as it builds on existing refinement techniques.
The study tackled the problem of low-quality and harmful user prompts limiting large language models (LLMs) by proposing a reinforcement learning-driven query refinement framework, which improved response quality and robustness against jailbreak attacks, with experiments showing enhanced performance.
The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: https://github.com/Huangzisu/query-refinement .