CL LGOct 10, 2023

Text Embeddings Reveal (Almost) As Much As Text

John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush

arXiv:2310.06816v128.9230 citationsh-index: 69Has Code

Originality Incremental advance

AI Analysis

This work addresses privacy concerns for users of text embedding models by demonstrating significant information leakage, making it an incremental but impactful contribution to security and privacy in NLP.

The paper tackles the problem of reconstructing original text from dense text embeddings, revealing that a multi-step iterative method can recover 92% of 32-token text inputs exactly, and also recovers personal information like full names from clinical notes.

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92\%$ of $32\text{-token}$ text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github: \href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.

View on arXiv PDF Code

Similar