Japanese SimCSE Technical Report
This work addresses a gap in sentence embedding research for Japanese, providing a baseline for future studies, but it is incremental as it applies an existing method to a new language.
The researchers tackled the lack of Japanese sentence embedding models by developing Japanese SimCSE through fine-tuning, achieving evaluation results across multiple datasets and models.
We report the development of Japanese SimCSE, Japanese sentence embedding models fine-tuned with SimCSE. Since there is a lack of sentence embedding models for Japanese that can be used as a baseline in sentence embedding research, we conducted extensive experiments on Japanese sentence embeddings involving 24 pre-trained Japanese or multilingual language models, five supervised datasets, and four unsupervised datasets. In this report, we provide the detailed training setup for Japanese SimCSE and their evaluation results.