SEAIMay 21

Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

arXiv:2605.2310824.3Has Code
Predicted impact top 78% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For software engineering teams using AI-assisted code review, this work introduces a method to generate diverse, non-redundant feedback by imposing behavioral constraints, though the evaluation is limited by lack of inter-rater agreement and small cross-model validation.

The authors present a system that constrains AI code review behavior through philosophical dispositions (e.g., Pyrrhonist Skepticism, Confucian ethics), achieving 46% convergence with human reviewers, 75% unique findings, and 0% false positives across 601 findings on 50 pull requests. The system produces 51% of findings not generated by generic expert prompting.

AI-assisted code review tools typically operate as generic "expert reviewer" agents, producing homogeneous findings regardless of the analysis type needed. We present a system that constrains AI reviewer behavior through philosophical dispositions -- coherent personality lenses grounded in specific epistemological traditions (Pyrrhonist Skepticism, Navya-Ny=aya logic, Diogenes' Cynicism, Confucian relational ethics) that direct attention to structurally different types of issues. Each disposition is defined apophatically (by what it refuses to do), equipped with a self-monitoring failure mode (hamartia), and orchestrated in sequence by role protocols. We evaluate this system on 50 merged pull requests across 7 repositories spanning 5 programming languages (Python, Go, C++, Java, Terraform), 5 organizations (2 enterprise, 3 open-source), and 2 temporal eras (pre-AI 2020, post-AI 2024--2026). The disposition system achieves 46% convergence with human reviewers (validating signal quality), identifies unique findings at a 75% rate, and produces no findings judged false-positive by the author across 601 total findings (inter-rater agreement was not assessed and remains a limitation). A controlled baseline comparison demonstrates that 51% of disposition findings are not produced by the same model using generic "expert reviewer" prompting, and these unique findings target structural, operational, and logical concerns rather than standard code-level issues. Preliminary cross-model validation (Claude Opus vs.\ GPT Codex 5.3-xhigh) on 3 PRs shows 100% framework-structure adherence with 39% finding-level agreement, suggesting the framework provides real behavioral constraint while preserving model-specific analytical perspective.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes