CLNov 1, 2024

Generic Embedding-Based Lexicons for Transparent and Reproducible Text Scoring

arXiv:2411.00964v1
Originality Incremental advance
AI Analysis

This addresses the problem for researchers needing reproducible and efficient text scoring tools, though it is incremental as it builds on existing embedding methods.

The paper tackles the trade-off between high-performance but opaque text analysis models and transparent but limited manual lexicons by proposing embedding-based lexicons from pretrained word embeddings like FastText and GloVe, resulting in tools that are transparent and high-performance.

With text analysis tools becoming increasingly sophisticated over the last decade, researchers now face a decision of whether to use state-of-the-art models that provide high performance but that can be highly opaque in their operations and computationally intensive to run. The alternative, frequently, is to rely on older, manually crafted textual scoring tools that are transparently and easily applied, but can suffer from limited performance. I present an alternative that combines the strengths of both: lexicons created with minimal researcher inputs from generic (pretrained) word embeddings. Presenting a number of conceptual lexicons produced from FastText and GloVe (6B) vector representations of words, I argue that embedding-based lexicons respond to a need for transparent yet high-performance text measuring tools.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes