CLApr 8

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

arXiv:2604.0705418.21 citations
AI Analysis

This addresses the problem of measuring deal progression and outcomes in sales dialogues for AI researchers and developers, though it is incremental as it builds on existing benchmarking methods.

The authors tackled the lack of benchmarks for evaluating LLMs in realistic sales dialogues by introducing SalesLLM, a bilingual benchmark with 1,805 multi-turn scenarios, and found that top-performing LLMs achieve human-level performance while others fall short.

Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes