Harnessing the Universal Geometry of Embeddings
This work addresses a security vulnerability in vector databases, allowing unauthorized extraction of sensitive data from embeddings, which is a significant concern for data privacy and protection.
The paper tackles the problem of translating text embeddings between different vector spaces without paired data, achieving high cosine similarity across diverse models, and reveals that this ability poses security risks for vector databases by enabling adversaries to extract sensitive information.
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.