CLAIOct 5, 2025

Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

arXiv:2510.04016v1h-index: 2ICSEC
Originality Synthesis-oriented
AI Analysis

This work addresses low-latency end-of-turn detection for Thai voice agents, establishing a baseline but is incremental as it adapts existing methods to a new language.

The paper tackled the problem of detecting when a user has finished speaking in real-time voice agents for Thai language, comparing methods like zero-shot prompting and fine-tuning lightweight transformers, and found that small, fine-tuned models can achieve near-instant decisions with a clear accuracy-latency tradeoff.

Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes