Learning language variations in news corpora through differential embeddings
This work addresses the need for NLP models to handle language variations, such as semantic drift and dialects, but it is incremental as it builds on existing dynamical embeddings with a specific architectural modification.
The authors tackled the problem of capturing language variations across time and regions by proposing a model with a central word representation and slice-dependent contributions, which they applied to The New York Times and The Guardian to capture temporal dynamics and US/UK English differences.
There is an increasing interest in the NLP community in capturing variations in the usage of language, either through time (i.e., semantic drift), across regions (as dialects or variants) or in different social contexts (i.e., professional or media technolects). Several successful dynamical embeddings have been proposed that can track semantic change through time. Here we show that a model with a central word representation and a slice-dependent contribution can learn word embeddings from different corpora simultaneously. This model is based on a star-like representation of the slices. We apply it to The New York Times and The Guardian newspapers, and we show that it can capture both temporal dynamics in the yearly slices of each corpus, and language variations between US and UK English in a curated multi-source corpus. We provide an extensive evaluation of this methodology.