Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness

arXiv:2603.2753925.81 citations

Predicted impact top 16% in MA · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners in AI-driven finance, this work provides a structured taxonomy and highlights critical evaluation pitfalls, but it is a survey with hypotheses rather than empirical validation.

This survey identifies the lack of a shared evaluation framework for LLM-based financial multi-agent systems, proposing a taxonomy and the Coordination Primacy Hypothesis (CPH) that coordination design may matter more than model scaling. It documents five evaluation failures that can reverse reported returns and introduces the Coordination Breakeven Spread (CBS) metric.

Multi-agent systems based on large language models (LLMs) for financial trading have grown rapidly since 2023, yet the field lacks a shared framework for understanding what drives performance or for evaluating claims credibly. This survey makes three contributions. First, we introduce a four-dimensional taxonomy, covering architecture pattern, coordination mechanism, memory architecture, and tool integration; applied to 12 multi-agent systems and two single-agent baselines. Second, we formulate the Coordination Primacy Hypothesis (CPH): inter-agent coordination protocol design is a primary driver of trading decision quality, often exerting greater influence than model scaling. CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion; its definitive validation requires evaluation infrastructure that does not yet exist in the field. Third, we document five pervasive evaluation failures (look-ahead bias, survivorship bias, backtesting overfitting, transaction cost neglect, and regime-shift blindness) and show that these can reverse the sign of reported returns. Building on the CPH and the evaluation critique, we introduce the Coordination Breakeven Spread (CBS), a metric for determining whether multi-agent coordination adds genuine value net of transaction costs, and propose minimum evaluation standards as prerequisites for validating the CPH.

View on arXiv PDF

Similar