CAM: A Causality-based Analysis Framework for Multi-Agent Code Generation Systems
This work addresses the challenge of optimizing complex multi-agent code generation systems for developers and researchers, though it appears incremental as it applies causality analysis to a specific domain.
The paper tackles the problem of understanding the importance of intermediate outputs in multi-agent code generation systems (MACGS) by proposing CAM, a causality-based analysis framework that quantifies feature contributions to system correctness. The results include identifying context-dependent features, achieving up to 7.2% Pass@1 improvement with hybrid architectures, and enabling applications like failure repair with 73.3% success rate and feature pruning reducing token consumption by up to 66.8%.
Despite the remarkable success that Multi-Agent Code Generation Systems (MACGS) have achieved, the inherent complexity of multi-agent architectures produces substantial volumes of intermediate outputs. To date, the individual importance of these intermediate outputs to the system correctness remains opaque, which impedes targeted optimization of MACGS designs. To address this challenge, we propose CAM, the first \textbf{C}ausality-based \textbf{A}nalysis framework for \textbf{M}ACGS that systematically quantifies the contribution of different intermediate features for system correctness. By comprehensively categorizing intermediate outputs and systematically simulating realistic errors on intermediate features, we identify the important features for system correctness and aggregate their importance rankings. We conduct extensive empirical analysis on the identified importance rankings. Our analysis reveals intriguing findings: first, we uncover context-dependent features\textemdash features whose importance emerges mainly through interactions with other features, revealing that quality assurance for MACGS should incorporate cross-feature consistency checks; second, we reveal that hybrid backend MACGS with different backend LLMs assigned according to their relative strength achieves up to 7.2\% Pass@1 improvement, underscoring hybrid architectures as a promising direction for future MACGS design. We further demonstrate CAM's practical utility through two applications: (1) failure repair which achieves a 73.3\% success rate by optimizing top-3 importance-ranked features and (2) feature pruning that reduces up to 66.8\% intermediate token consumption while maintaining generation performance. Our work provides actionable insights for MACGS design and deployment, establishing causality analysis as a powerful approach for understanding and improving MACGS.