CLFeb 13, 2023

Evaluation of Word Embeddings for the Social Sciences

arXiv:2302.06174v1581 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This addresses a gap in NLP resources for social science researchers by providing domain-specific embeddings, though it is incremental as it applies existing methods to new data.

The researchers tackled the lack of domain-specific word embeddings for social sciences by creating and evaluating models trained on 37,604 open-access social science papers, finding that their domain-specific model covered a large part of social science concepts and provided more extensive coverage of semantic relationships compared to general language models.

Word embeddings are an essential instrument in many NLP tasks. Most available resources are trained on general language from Web corpora or Wikipedia dumps. However, word embeddings for domain-specific language are rare, in particular for the social science domain. Therefore, in this work, we describe the creation and evaluation of word embedding models based on 37,604 open-access social science research papers. In the evaluation, we compare domain-specific and general language models for (i) language coverage, (ii) diversity, and (iii) semantic relationships. We found that the created domain-specific model, even with a relatively small vocabulary size, covers a large part of social science concepts, their neighborhoods are diverse in comparison to more general models. Across all relation types, we found a more extensive coverage of semantic relationships.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes