CLSep 3, 2019

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

arXiv:1909.01380v11099 citations
Originality Incremental advance
AI Analysis

This work provides incremental insights into representation learning in Transformers, addressing a foundational problem for researchers in natural language processing and deep learning.

The study investigated how token representations and feature spaces evolve across Transformer layers under different learning objectives, revealing that left-to-right language models gradually lose past information and form future predictions, while masked language models initially generalize token identity before recreating it at top layers.

We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We focus on the Transformers for our analysis as they have been shown effective on various tasks, including machine translation (MT), standard left-to-right language models (LM) and masked language modeling (MLM). Previous work used black-box probing tasks to show that the representations learned by the Transformer differ significantly depending on the objective. In this work, we use canonical correlation analysis and mutual information estimators to study how information flows across Transformer layers and how this process depends on the choice of learning objective. For example, as you go from bottom to top layers, information about the past in left-to-right language models gets vanished and predictions about the future get formed. In contrast, for MLM, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation. The token identity then gets recreated at the top MLM layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes