Benchmarking LLM Agents for Wealth-Management Workflows
This addresses the problem of human error and delay in routine wealth-management workflows for financial professionals, though it appears incremental as an extension of existing agent frameworks.
The study investigated whether general-purpose LLM agents can accurately and economically complete wealth-management tasks by creating a benchmark of 12 task-pairs with explicit acceptance criteria and deterministic graders. The results showed agents are limited more by end-to-end workflow reliability than mathematical reasoning, are meaningfully affected by autonomy level, and that incorrect model evaluation has hindered benchmarking.
Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.