CLDec 10, 2024

Granite Guardian

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed

IBM

arXiv:2412.07724v212.919 citationsh-index: 43Has Code

Originality Incremental advance

AI Analysis

This addresses the need for safe and responsible AI use across the community by providing a generalizable risk detection model, though it is incremental as it builds on existing risk detection methods with new coverage.

The paper tackles the problem of detecting multiple risks in LLM prompts and responses, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks for RAG, by introducing the Granite Guardian models, which achieve AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination benchmarks, respectively.

We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian

View on arXiv PDF Code

Similar