CLFeb 24

Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

Felix Schneider, Maria Gogolev, Sven Sickert, Joachim Denzler

arXiv:2602.21377v10.6h-index: 6

Originality Incremental advance

AI Analysis

This addresses limitations in NLP for under-resourced and morphologically rich languages, offering a drop-in replacement for embeddings that could improve performance in models like BERT and word2vec, though it appears incremental as a hybrid method building on existing transformer and convolutional mechanisms.

The paper tackles the problem of tokenization-based models failing to capture orthographic similarities and morphological variations in low-resource and morphologically complex languages by proposing Rich Character Embeddings (RCE), a transformer-based approach that computes word vectors directly from character strings, and shows it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.

View on arXiv PDF

Similar