A Lightweight Explainable Guardrail for Prompt Safety
For developers of LLM applications, LEG provides an efficient, interpretable guardrail to detect unsafe prompts without the computational cost of large models.
LEG achieves state-of-the-art prompt safety classification and explainability with a model 10x smaller than existing methods, matching or outperforming them on three datasets in-domain and out-of-domain.
We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained on synthetic explanation data, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals as a weak supervision and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.