N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
This provides an efficient, real-time safety evaluation tool for LLM developers, though it is incremental as it builds on existing red teaming methods.
The paper tackled the high cost and latency of evaluating LLM safety via red teaming by proposing N-GLARE, a method that uses latent representations instead of text generation, achieving results consistent with traditional red teaming at less than 1% of token and runtime costs.
Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.