CLMar 7, 2019

Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

arXiv:1903.02671v14 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for domain-specific evaluation in digital humanities, but it is incremental as it adapts existing methods to new data.

The study tackled the problem of applying word embedding technologies to the specialized digital humanities domain by training models on fantasy novel series and evaluating them on term analogies and word intrusion tasks, finding that even embeddings from small corpora perform well, such as on word intrusion.

Word embeddings are already well studied in the general domain, usually trained on large text corpora, and have been evaluated for example on word similarity and analogy tasks, but also as an input to downstream NLP processes. In contrast, in this work we explore the suitability of word embedding technologies in the specialized digital humanities domain. After training embedding models of various types on two popular fantasy novel book series, we evaluate their performance on two task types: term analogies, and word intrusion. To this end, we manually construct test datasets with domain experts. Among the contributions are the evaluation of various word embedding techniques on the different task types, with the findings that even embeddings trained on small corpora perform well for example on the word intrusion task. Furthermore, we provide extensive and high-quality datasets in digital humanities for further investigation, as well as the implementation to easily reproduce or extend the experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes