Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs
This addresses the safety-utility trade-off in domain-expert LLMs, offering an incremental improvement for developing safer specialized AI models.
The paper tackles the problem of domain-expert LLMs losing safety abilities by introducing MergeAlign, a merging-based alignment method that interpolates domain and alignment vectors, resulting in safer domain-specific models with minimal degradation on domain benchmarks.
There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.