CLAIIRFeb 3

RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

arXiv:2602.03652v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of English-centric RAG design for Turkish NLP practitioners, though it is incremental as it adapts existing methods to a new language.

The authors tackled the lack of retrieval-augmented generation (RAG) guidance for morphologically rich languages by creating a Turkish RAG dataset and benchmarking pipeline stages, finding that complex methods like HyDE achieved 85% accuracy (vs. 78.70% baseline) while Pareto-optimal configurations offered comparable performance with lower cost.

Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes