CLJun 30, 2025

Real-World En Call Center Transcripts Dataset with PII Redaction

arXiv:2507.02958v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This dataset fills a critical gap for researchers and developers working on AI systems for customer support and sales by providing a real-world, privacy-compliant resource.

The authors tackled the scarcity of publicly available real-world call center datasets by introducing CallCenterEN, a large-scale English call center transcript dataset with 91,706 conversations (10,448 audio hours), which is the largest open-source release of its kind and includes PII-redacted transcriptions to ensure data privacy compliance.

We introduce CallCenterEN, a large-scale (91,706 conversations, corresponding to 10448 audio hours), real-world English call center transcript dataset designed to support research and development in customer support and sales AI systems. This is the largest release to-date of open source call center transcript data of this kind. The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States. The dataset includes high-quality, PII-redacted human-readable transcriptions. All personally identifiable information (PII) has been rigorously removed to ensure compliance with global data protection laws. The audio is not included in the public release due to biometric privacy concerns. Given the scarcity of publicly available real-world call center datasets, CallCenterEN fills a critical gap in the landscape of available ASR corpora, and is released under a CC BY-NC 4.0 license for non-commercial research use.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes