Tracing the Evolution of Word Embedding Techniques in Natural Language Processing

Minh Anh Nguyen, Kuheli Sai, Minh Nguyen

arXiv:2603.13271h-index: 2

AI Analysis

This work provides a comprehensive review and quantitative analysis of how representation learning has evolved in NLP, offering insights for researchers and practitioners in the field.

This study traces the evolution of word-embedding techniques in NLP by analyzing 149 articles from 1954 to 2025, revealing a dramatic post-GPT-3 paradigm shift where contextual and sentence-level methods now dominate at 6.4X the odds of the pre-GPT-3 era, with significant growth in team sizes and emergence of 30 new techniques.

This work traces the evolution of word-embedding techniques within the natural language processing (NLP) literature. We collect and analyze 149 research articles spanning the period from 1954 to 2025, providing both a comprehensive methodological review and a data-driven bibliometric analysis of how representation learning has developed over seven decades. Our study covers four major embedding paradigms, statistical representation-based methods (one-hot encoding, bag-of-words, TF-IDF), static word embeddings (Word2Vec, GloVe, FastText), contextual word embeddings (ELMo, BERT, GPT), and sentence/document embeddings, critically discussing the strengths, limitations, and intellectual lineage connecting each category. Beyond the methodological survey, we conduct a formal era comparison using GPT-3's release as a dividing line, applying seven hypothesis tests to quantify shifts in research focus, collaboration patterns, and institutional involvement. Our analysis reveals a dramatic post-GPT-3 paradigm shift: contextual and sentence-level methods now dominate at 6.4X the odds of the pre-GPT-3 era, mean team sizes have grown significantly (p = 0.018), and 30 entirely new techniques have emerged while 54 pre-GPT-3 methods received no further attention. These findings, combined with evidence of rising industry involvement, provide a quantitative account of how the field's epistemic priorities have been reshaped by the advent of large language models.

View on arXiv PDF

Similar