CLSep 12, 2024

Ruri: Japanese General Text Embeddings

arXiv:2409.07737v12 citationsh-index: 15
Originality Synthesis-oriented
AI Analysis

This addresses the problem of insufficient Japanese text embeddings for NLP applications, but it is incremental as it adapts existing methods to a new language context.

The authors tackled the lack of Japanese general text embedding models by developing Ruri, a series trained on synthesized datasets from LLMs, achieving performance evaluated through benchmarks.

We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes