CLMay 19, 2023

Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

arXiv:2305.11516v1
Originality Incremental advance
AI Analysis

This addresses the challenge of semantic analysis in computational linguistics for researchers and practitioners by offering a simpler, assumption-free method, though it is incremental as it builds on existing contextualized vector techniques.

The paper tackles the problem of discovering semantic differences between words across two corpora without requiring training or word alignment, by using the norms of contextualized word vectors to reflect meaning coverage, and demonstrates effectiveness in handling corpus size skew, detecting differences in infrequent words, and pinpointing missing meanings in native/non-native and historical English corpora.

In this paper, we propose methods for discovering semantic differences in words appearing in two corpora based on the norms of contextualized word vectors. The key idea is that the coverage of meanings is reflected in the norm of its mean word vector. The proposed methods do not require the assumptions concerning words and corpora for comparison that the previous methods do. All they require are to compute the mean vector of contextualized word vectors and its norm for each word type. Nevertheless, they are (i) robust for the skew in corpus size; (ii) capable of detecting semantic differences in infrequent words; and (iii) effective in pinpointing word instances that have a meaning missing in one of the two corpora for comparison. We show these advantages for native and non-native English corpora and also for historical corpora.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes