CL IRNov 11, 2025

TurkEmbed: Turkish Embedding Model on NLI & STS Tasks

Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç

arXiv:2511.08376v12 citationsh-index: 42025 Innovations in Intelligent Systems and Applications Conference (ASYU)

Originality Incremental advance

AI Analysis

This provides a more accurate embedding model for Turkish NLP applications, though it is incremental as it builds on existing methods with specific dataset and training enhancements.

The paper tackles the problem of limited accuracy in Turkish embedding models by introducing TurkEmbed, which achieves a 1-4% improvement over the state-of-the-art on All-NLI-TR and STS-b-TR benchmarks.

This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4\% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.

View on arXiv PDF

Similar