CLMay 29, 2023

Representation Of Lexical Stylistic Features In Language Models' Embedding Space

arXiv:2305.18657v2225 citations
AI Analysis

This work addresses the problem of automatically characterizing text style for NLP applications, but it is incremental as it builds on existing embedding methods.

The paper demonstrates that lexical stylistic features like complexity, formality, and figurativeness can be identified in language model embedding spaces using small seed pairs, with experiments on five datasets showing static embeddings are more accurate for words/phrases while contextualized models perform better on sentences.

The representation space of pretrained Language Models (LMs) encodes rich information about words and their relationships (e.g., similarity, hypernymy, polysemy) as well as abstract semantic notions (e.g., intensity). In this paper, we demonstrate that lexical stylistic notions such as complexity, formality, and figurativeness, can also be identified in this space. We show that it is possible to derive a vector representation for each of these stylistic notions from only a small number of seed pairs. Using these vectors, we can characterize new texts in terms of these dimensions by performing simple calculations in the corresponding embedding space. We conduct experiments on five datasets and find that static embeddings encode these features more accurately at the level of words and phrases, whereas contextualized LMs perform better on sentences. The lower performance of contextualized representations at the word level is partially attributable to the anisotropy of their vector space, which can be corrected to some extent using techniques like standardization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes