CLJun 18, 2020

AMALGUM -- A Free, Balanced, Multilayer English Web Corpus

arXiv:2006.10677v1997 citations
Originality Synthesis-oriented
AI Analysis

This provides a sizable, balanced alternative to smaller manual datasets for NLP researchers, addressing issues like licensing and composition imbalance, though it is incremental in leveraging existing annotation methods.

The authors tackled the lack of freely available, balanced annotated English web corpora by creating AMALGUM, a 4M-token corpus with high-quality automatic annotations like dependency trees and coreference resolution, achieving a 'better than NLP' benchmark in evaluation.

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a "better than NLP" benchmark and evaluate the accuracy of the resulting resource.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes