CLAIApr 1, 2024

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

arXiv:2404.01295v114 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the problem of balancing user engagement and potential harm in LLMs for users and developers, but it is incremental as it builds on existing controllability techniques.

The paper tackles the trade-off between safety and helpfulness in large language models (LLMs) by proposing methods to control both attributes, demonstrating that their approach can rewind a learned model and unlock its controllability.

As large language models (LLMs) become easily accessible nowadays, the trade-off between safety and helpfulness can significantly impact user experience. A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. Possible harms include teaching people how to build a bomb, exposing youth to inappropriate content, and hurting users' mental health. In this work, we propose to balance safety and helpfulness in diverse use cases by controlling both attributes in LLM. We explore training-free and fine-tuning methods that do not require extra human annotations and analyze the challenges of controlling safety and helpfulness in LLMs. Our experiments demonstrate that our method can rewind a learned model and unlock its controllability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes