SEAICLApr 7, 2025

R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

arXiv:2504.04699v29 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the need for more precise and explainable vulnerability detection in software security, though it is incremental as it builds on existing LLM and reasoning methods.

The paper tackled the problem of unreliable reasoning in large language models for software vulnerability detection by proposing R2Vul, which combines reinforcement learning from AI feedback and structured reasoning distillation to teach small code LLMs, resulting in a 1.5B model that outperforms a 32B teacher model and leading commercial LLMs like Claude-4-Opus.

Large language models (LLMs) have shown promising performance in software vulnerability detection, yet their reasoning capabilities remain unreliable. We propose R2Vul, a method that combines reinforcement learning from AI feedback (RLAIF) and structured reasoning distillation to teach small code LLMs to detect vulnerabilities while generating security-aware explanations. Unlike prior chain-of-thought and instruction tuning approaches, R2Vul rewards well-founded over deceptively plausible vulnerability explanations through RLAIF, which results in more precise detection and high-quality reasoning generation. To support RLAIF, we construct the first multilingual preference dataset for vulnerability detection, comprising 18,000 high-quality samples in C\#, JavaScript, Java, Python, and C. We evaluate R2Vul across five programming languages and against four static analysis tools, eight state-of-the-art LLM-based baselines, and various fine-tuning approaches. Our results demonstrate that a 1.5B R2Vul model exceeds the performance of its 32B teacher model and leading commercial LLMs such as Claude-4-Opus. Furthermore, we introduce a lightweight calibration step that reduces false positive rates under varying imbalanced data distributions. Finally, through qualitative analysis, we show that both LLM and human evaluators consistently rank R2Vul model's reasoning higher than other reasoning-based baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes