Cross-lingual paraphrase identification
This addresses the problem of multilingual semantic similarity for NLP applications, but it is incremental as it builds on existing methods with minor improvements.
The paper tackled cross-lingual paraphrase identification by training a bi-encoder model contrastively, achieving performance comparable to state-of-the-art cross-encoders with only a 7-10% relative drop on the dataset.
The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.