Improving Vietnamese Legal Document Retrieval using Synthetic Data
This addresses the challenge of limited labeled data for legal information retrieval in the Vietnamese domain, though it is incremental as it applies existing techniques to a specific context.
The paper tackled the problem of scarce annotated datasets for Vietnamese legal document retrieval by generating synthetic queries with large language models and pre-training retrieval models, resulting in strong improvements in retrieval accuracy.
In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.