CLAIApr 15

Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

arXiv:2604.136206.5h-index: 3
Predicted impact top 90% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

Addresses the lack of turn-taking datasets for Turkish, enabling more natural voice-based chatbots for Turkish speakers.

Syn-TurnTurk, a synthetic Turkish dialogue dataset for turn-taking prediction, was generated using Qwen LLMs. BI-LSTM and Ensemble (LR+RF) models achieved 0.839 accuracy and 0.910 AUC, showing synthetic data can improve natural interaction.

Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes