IRAILGMar 10, 2025

Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark

arXiv:2503.07470v12 citationsh-index: 13PACLIC
Originality Synthesis-oriented
AI Analysis

This work addresses a gap for Vietnamese NLP researchers by providing tools to advance information retrieval, but it is incremental as it adapts existing methods to a new language context.

The paper tackles the lack of Vietnamese benchmarks for information retrieval by introducing a new benchmark and a novel objective function based on InfoNCE loss, resulting in improved performance for Vietnamese embedding models, though specific numerical gains are not detailed.

With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes