IRAIMMDec 29, 2022

BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

arXiv:2212.14322v13 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of high latency and low throughput in cross-modal retrieval systems, offering an efficient solution for applications requiring real-time processing, though it is incremental in improving dual encoder models.

The paper tackles the trade-off between performance and efficiency in cross-modal retrieval by introducing BagFormer, a dual encoder model that uses bag-wise interactions to achieve recall comparable to state-of-the-art single encoder models, with 20.72 times lower latency and 25.74 times higher throughput.

In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we present a dual encoder model called BagFormer that utilizes a cross modal interaction mechanism to improve recall performance without sacrificing latency and throughput. BagFormer achieves this through the use of bag-wise interactions, which allow for the transformation of text to a more appropriate granularity and the incorporation of entity knowledge into the model. Our experiments demonstrate that BagFormer is able to achieve results comparable to state-of-the-art single encoder models in cross-modal retrieval tasks, while also offering efficient training and inference with 20.72 times lower latency and 25.74 times higher throughput.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes