Ariel Ekgren

h-index4

4papers

1,710citations

Novelty19%

AI Score21

Ranked #182,270 of 194,257 authors (top 94%)#29,718 in CL (top 97%)

4 Papers

2.1CLMar 30, 2023

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Joey Öhman, Severine Verlinden, Ariel Ekgren et al.

Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.

4.3CLMay 22, 2023

GPT-SW3: An Autoregressive Language Model for the Nordic Languages

Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk et al.

This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.

29.6CLSep 15, 2021

Cross-lingual Transfer of Monolingual Models

Evangelia Gogoulou, Ariel Ekgren, Tim Isbister et al.

Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform the native English model independently of the source language. After probing the English linguistic knowledge encoded in the representations before and after transfer, we find that semantic information is retained from the source language, while syntactic information is learned during transfer. Additionally, the results of evaluating the transferred models in source language tasks reveal that their performance in the source domain deteriorates after transfer.

31.9CLAug 14, 2018

R-grams: Unsupervised Learning of Semantic Units in Natural Language

Ariel Ekgren, Amaru Cuba Gyllensten, Magnus Sahlgren

This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.