CLNov 30, 2019

Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

arXiv:1912.00159v3998 citations
Originality Incremental advance
AI Analysis

This addresses the lack of resources for low-resource languages like Swiss German, enabling better NLP applications, though it is incremental as it adapts existing web scraping methods.

The paper tackles the problem of creating text corpora for low-resource languages by developing SwissCrawl, a web scraping tool that generated the largest Swiss German corpus with over half a million sentences, leading to significant improvements in language modeling tasks.

This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling. To capture new content, our approach will run continuously to keep increasing the corpus over time.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes