DCAINov 18, 2024

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

arXiv:2411.17713v123 citationsh-index: 26Has Code
Originality Synthesis-oriented
AI Analysis

This provides an efficient safeguard for human-AI conversations on mobile devices, though it is incremental as it builds on existing Llama Guard models.

The paper tackles the problem of deploying AI safety moderation on resource-constrained devices by introducing Llama Guard 3-1B-INT4, a compact model that achieves at least 30 tokens per second throughput and 2.5 seconds time-to-first-token on a mobile CPU while maintaining comparable or superior safety scores to its larger counterpart despite being 7 times smaller.

This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes