SEMay 12

NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

arXiv:2605.1148217.1
AI Analysis

For software engineers relying on regression testing, NeuroFlake provides a more accurate and robust method to classify flaky tests, addressing the semantic fragility of LLMs.

NeuroFlake, a neuro-symbolic framework, improves flaky test classification F1-score to 69.34% from prior best 65.79% on imbalanced real-world data, and shows robustness to adversarial perturbations with only 4-7 pp drop vs. 8-18 pp for baselines.

Flaky tests, which exhibit non-deterministic pass/fail behavior for the same version of code, pose significant challenges to reliable regression testing. While large language models (LLMs) promise for automated flaky test classification, they often fail to comprehend the actual logic behind test flakiness, instead overfitting to superficial textual artifacts (e.g., specific variable names). This semantic fragility leads to poor generalization on real-world imbalance dataset and vulnerability to perturbations. In this paper, we introduce NeuroFlake, a novel neuro-Symbolic framework for classifying flaky tests on highly imbalanced, real-world datasets (FlakeBench). Unlike prior approaches that rely on brittle manual rule and black box learning, NeuroFlake integrates a Discriminative Token Mining (DTM) module to automate the discovery of high-fidelity, statistically significant source code tokens (e.g., specific concurrency primitives or async waits). By injecting these strong latent signals directly into LLM's attention mechanism, we bridge the gap between neural intuition and symbolic precision. Our experiments demonstrate that neuro-symbolic fusion significantly improves classification performance by leveraging classification F1-score to 69.34% while prior state-of-art shows best F1-score 65.79%. However, we rigorously evaluate NeuroFlake's robustness through adversarial stress testing, introducing semantic preserving augmentations (e.g., dead code injection, variable renaming). While baseline models exhibit performance degradation of 8-18 percentage points (pp) on perturbed tests, NeuroFlake maintains performance stability on unseen augmentations dropping only 4-7 pp.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes