CLLGJul 24, 2025

A New Pair of GloVes

arXiv:2507.18103v14 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This work provides updated word embeddings for NLP practitioners, but it is incremental as it primarily refreshes existing models with new data and better documentation.

The authors tackled the problem of outdated and poorly documented GloVe word embeddings by training new 2024 models using updated data sources like Wikipedia, Gigaword, and Dolma, resulting in vectors that incorporate new culturally relevant words and show improved performance on recent non-Western NER datasets.

This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes