CRMay 8

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

arXiv:2605.0729349.4
AI Analysis

For researchers evaluating LLMs in security operations centers (SOCs), this paper identifies a critical evaluation methodology flaw that can render models appear non-functional.

The paper shows that regex-based parsing of LLM outputs can cause a 76 percentage point accuracy drop (0% vs 76% threat accuracy) in security log classification, with a fuzzy parser recovering performance. The authors propose SOC-Bench v0 to prevent such evaluation errors.

LLM-based SOC log classifiers are commonly evaluated using regular-expression pipelines that extract structured fields from free-form model output. We demonstrate that this practice introduces a class of silent, systematic evaluation errors, which we term parsing-induced suppression that can cause a fully functional model to appear completely non-functional. Using OpenSOC-AI, a LoRA fine-tuned TinyLlama-1.1B system for security log threat classification, as a reproducible case study, we show that a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy on the same model outputs and the same evaluation set. A gap of 76 percentage points attributable entirely to evaluation methodology. Severity accuracy remained constant at 58% under both parsers, providing a built-in control that isolates field name format mismatch as the causal mechanism rather than model degradation. For external reference, Claude Sonnet evaluated zero-shot on the same 50 example set achieved 88% threat accuracy and 58% severity accuracy under the same fuzzy protocol. Residual errors under fuzzy evaluation concentrate in three categories including reconnaissance, brute force, and credential stuffing, each contributing all 4 misclassifications, a pattern that reflects class-boundary difficulty among behaviorally adjacent log types rather than global model failure. We propose SOC-Bench v0, a benchmark framework comprising a standardized 13 category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and a public scoring script intended to prevent parser specific accuracy distortion in future SOC LLM research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes