AS CLJul 15, 2025

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

Rumi Allbert, Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi

Stanford

arXiv:2507.16835v218.335 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This provides practical guidance for selecting components in multimodal conversational AI systems, particularly for job interviews, but is incremental as it compares existing methods on new data.

The paper tackled the problem of evaluating cascaded speech-to-text, LLM, and text-to-speech combinations for AI interview systems by conducting a large-scale empirical comparison using data from over 300,000 interviews, finding that a stack with Google's STT, GPT-4.1, and Cartesia's TTS outperformed alternatives in quality metrics and user satisfaction scores, but noted weak correlation between objective metrics and user satisfaction.

Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.

View on arXiv PDF

Similar