CLAILGJun 13, 2025

Improving Large Language Model Safety with Contrastive Representation Learning

arXiv:2506.11938v14 citationsh-index: 40Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses safety concerns for users of LLMs by providing a more generalizable defense against adversarial attacks, though it is incremental as it builds on existing representation engineering methods.

The paper tackles the vulnerability of Large Language Models to adversarial attacks by proposing a contrastive representation learning defense framework, which improves robustness against various attack types without compromising standard performance.

Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes