Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
This work addresses scalable image retrieval from natural language descriptions for applications like search engines and media archiving, representing an incremental improvement over existing methods.
The paper tackled the challenge of real-world image-text retrieval by proposing a lightweight two-stage pipeline that uses event-centric entity extraction for candidate filtering and deep multimodal semantics for reranking, achieving a mean average precision of 0.559 on the OpenEvents v1 benchmark, which substantially outperforms prior baselines.
Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval