CRMay 8

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

Chaitanya Vilas Garware, Sharif Noor Zisad

arXiv:2605.0729349.4

AI Analysis

For researchers evaluating LLMs in security operations centers (SOCs), this paper identifies a critical evaluation methodology flaw that can render models appear non-functional.

The paper shows that regex-based parsing of LLM outputs can cause a 76 percentage point accuracy drop (0% vs 76% threat accuracy) in security log classification, with a fuzzy parser recovering performance. The authors propose SOC-Bench v0 to prevent such evaluation errors.

LLM-based SOC log classifiers are commonly evaluated using regular-expression pipelines that extract structured fields from free-form model output. We demonstrate that this practice introduces a class of silent, systematic evaluation errors, which we term parsing-induced suppression that can cause a fully functional model to appear completely non-functional. Using OpenSOC-AI, a LoRA fine-tuned TinyLlama-1.1B system for security log threat classification, as a reproducible case study, we show that a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy on the same model outputs and the same evaluation set. A gap of 76 percentage points attributable entirely to evaluation methodology. Severity accuracy remained constant at 58% under both parsers, providing a built-in control that isolates field name format mismatch as the causal mechanism rather than model degradation. For external reference, Claude Sonnet evaluated zero-shot on the same 50 example set achieved 88% threat accuracy and 58% severity accuracy under the same fuzzy protocol. Residual errors under fuzzy evaluation concentrate in three categories including reconnaissance, brute force, and credential stuffing, each contributing all 4 misclassifications, a pattern that reflects class-boundary difficulty among behaviorally adjacent log types rather than global model failure. We propose SOC-Bench v0, a benchmark framework comprising a standardized 13 category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and a public scoring script intended to prevent parser specific accuracy distortion in future SOC LLM research.

View on arXiv PDF

Similar