DC AINov 18, 2024

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Changsheng Zhao

arXiv:2411.17713v112.625 citationsh-index: 26Has Code

Originality Synthesis-oriented

AI Analysis

This provides an efficient safeguard for human-AI conversations on mobile devices, though it is incremental as it builds on existing Llama Guard models.

The paper tackles the problem of deploying AI safety moderation on resource-constrained devices by introducing Llama Guard 3-1B-INT4, a compact model that achieves at least 30 tokens per second throughput and 2.5 seconds time-to-first-token on a mobile CPU while maintaining comparable or superior safety scores to its larger counterpart despite being 7 times smaller.

This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

View on arXiv PDF

Similar