CLLGOct 10, 2023

Text Embeddings Reveal (Almost) As Much As Text

arXiv:2310.06816v1224 citationsh-index: 69Has Code
Originality Incremental advance
AI Analysis

This work addresses privacy concerns for users of text embedding models by demonstrating significant information leakage, making it an incremental but impactful contribution to security and privacy in NLP.

The paper tackles the problem of reconstructing original text from dense text embeddings, revealing that a multi-step iterative method can recover 92% of 32-token text inputs exactly, and also recovers personal information like full names from clinical notes.

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92\%$ of $32\text{-token}$ text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github: \href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes