Disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings
This work addresses the challenge of interpreting distributed text embeddings for linguistic analysis, but it is incremental as it focuses on specific phenomena without broad application.
The authors tackled the problem of disentangling continuous and discrete linguistic signals in transformer-based sentence embeddings, showing that a latent layer with both discrete and continuous components better captures targeted phenomena like subject-verb agreement and verb alternations.
Sentence and word embeddings encode structural and semantic information in a distributed manner. Part of the information encoded -- particularly lexical information -- can be seen as continuous, whereas other -- like structural information -- is most often discrete. We explore whether we can compress transformer-based sentence embeddings into a representation that separates different linguistic signals -- in particular, information relevant to subject-verb agreement and verb alternations. We show that by compressing an input sequence that shares a targeted phenomenon into the latent layer of a variational autoencoder-like system, the targeted linguistic information becomes more explicit. A latent layer with both discrete and continuous components captures better the targeted phenomena than a latent layer with only discrete or only continuous components. These experiments are a step towards separating linguistic signals from distributed text embeddings and linking them to more symbolic representations.