CLLGJul 23, 2020

Exploring Swedish & English fastText Embeddings for NER with the Transformer

arXiv:2007.16007v23 citations
AI Analysis

This work addresses the challenge of resource efficiency in NLP for languages like Swedish, though it is incremental in optimizing existing methods.

The paper tackled the problem of achieving good NLP performance with smaller datasets by showing that embeddings from smaller corpora can outperform those from larger ones, obtaining better performance in Swedish and English NER tasks with smaller training data compared to Common Crawl versions.

In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from larger corpora and we make the new Swedish analogy test set publicly available. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at both the intrinsic and extrinsic levels. The embeddings are deployed with the Transformer in named entity recognition (NER) task and significance tests conducted. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with smaller training data, compared to recently released, Common Crawl versions; and character n-grams appear useful for Swedish, a morphologically rich language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes