CLApr 8

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, Leo Huang

arXiv:2604.0705418.21 citations

AI Analysis

This addresses the problem of measuring deal progression and outcomes in sales dialogues for AI researchers and developers, though it is incremental as it builds on existing benchmarking methods.

The authors tackled the lack of benchmarks for evaluating LLMs in realistic sales dialogues by introducing SalesLLM, a bilingual benchmark with 1,805 multi-turn scenarios, and found that top-performing LLMs achieve human-level performance while others fall short.

Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

View on arXiv PDF

Similar