CLJul 28, 2024

Open Sentence Embeddings for Portuguese with the Serafim PT* encoders family

arXiv:2407.19527v19 citationsh-index: 6Has Code
Originality Synthesis-oriented
AI Analysis

This provides accessible, high-performance sentence embeddings for Portuguese speakers and researchers, though it is incremental as it adapts existing methods to a new language.

The authors tackled the lack of open-source sentence encoders for Portuguese by developing the Serafim PT* family, which achieves state-of-the-art performance across various model sizes and is made available under a permissive license.

Sentence encoder encode the semantics of their input, enabling key downstream applications such as classification, clustering, or retrieval. In this paper, we present Serafim PT*, a family of open-source sentence encoders for Portuguese with various sizes, suited to different hardware/compute budgets. Each model exhibits state-of-the-art performance and is made openly available under a permissive license, allowing its use for both commercial and research purposes. Besides the sentence encoders, this paper contributes a systematic study and lessons learned concerning the selection criteria of learning objectives and parameters that support top-performing encoders.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes