CLLGOct 18, 2019

A Mutual Information Maximization Perspective of Language Representation Learning

arXiv:1910.08350v2179 citations
Originality Incremental advance
AI Analysis

It provides a theoretical unification for language representation learning, potentially enabling knowledge transfer across domains like NLP and computer vision.

The paper shows that state-of-the-art word representation learning methods maximize a lower bound on mutual information between parts of sentences, unifying classical and modern embeddings, and introduces a new self-supervised objective based on this framework.

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes