CVOct 6, 2025

Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"

arXiv:2510.05411v13.6h-index: 11CBMI

Originality Incremental advance

AI Analysis

This addresses the problem of personalized image retrieval for users needing to find specific objects in varied contexts, representing an incremental advance.

The paper tackles the problem of retrieving images using compound queries that combine object instance information from an image with a natural text description, such as finding an image of a specific unicorn on someone's head, and shows that their approach improves state-of-the-art on two personalized retrieval benchmarks.

The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of "Fluffy the unicorn (specified by an image) on someone's head". To achieve this we design a mapping network that can "translate" from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.

View on arXiv PDF

Similar