IRMar 7
Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic AnalysisMaria-Eirini Pegia, Dimitrios Stefanopoulos, Björn Þór Jónsson et al.
Text-to-video retrieval enables users to find relevant video content using natural language queries, a task that has grown increasingly important with the rapid expansion of online video. Over the past six years, research has produced numerous methods, such as dual encoders, attention-driven models, and multimodal fusion approaches; however, fundamental questions remain about model behavior, dataset influence, and query difficulty. In this work, we evaluate 14 state-of-the-art retrieval methods across 3 widely used datasets under a unified preprocessing and evaluation framework. We analyze caption characteristics, including length, clarity, semantic category, and Action vs. Scene balance, and link these to model performance. Our results show that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy. Overall, our findings highlight key dataset factors, benchmark challenges, and the interplay between query content and model architecture, providing guidance for developing more effective text-to-video retrieval systems.
MMMar 21, 2025
The CASTLE 2024 Dataset: Advancing the Art of Multimodal UnderstandingLuca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen et al.
Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.
MMApr 18, 2019
Exquisitor: Interactive Learning at LargeBjörn Þór Jónsson, Omar Shahbaz Khan, Hanna Ragnarsdóttir et al.
Increasing scale is a dominant trend in today's multimedia collections, which especially impacts interactive applications. To facilitate interactive exploration of large multimedia collections, new approaches are needed that are capable of learning on the fly new analytic categories based on the visual and textual content. To facilitate general use on standard desktops, laptops, and mobile devices, they must furthermore work with limited computing resources. We present Exquisitor, a highly scalable interactive learning approach, capable of intelligent exploration of the large-scale YFCC100M image collection with extremely efficient responses from the interactive classifier. Based on relevance feedback from the user on previously suggested items, Exquisitor uses semantic features, extracted from both visual and text attributes, to suggest relevant media items to the user. Exquisitor builds upon the state of the art in large-scale data representation, compression and indexing, introducing a cluster-based retrieval mechanism that facilitates the efficient suggestions. With Exquisitor, each interaction round over the full YFCC100M collection is completed in less than 0.3 seconds using a single CPU core. That is 4x less time using 16x smaller computational resources than the most efficient state-of-the-art method, with a positive impact on result quality. These results open up many interesting research avenues, both for exploration of industry-scale media collections and for media exploration on mobile devices.
DBMay 25, 2018
Dynamicity and Durability in Scalable Visual Instance SearchHerwig Lejsek, Björn Þór Jónsson, Laurent Amsaleg et al.
Visual instance search involves retrieving from a collection of images the ones that contain an instance of a visual query. Systems designed for visual instance search face the major challenge of scalability: a collection of a few million images used for instance search typically creates a few billion features that must be indexed. Furthermore, as real image collections grow rapidly, systems must also provide dynamicity, i.e., be able to handle on-line insertions while concurrently serving retrieval operations. Durability, which is the ability to recover correctly from software and hardware crashes, is the natural complement of dynamicity. Durability, however, has rarely been integrated within scalable and dynamic high-dimensional indexing solutions. This article addresses the issue of dynamicity and durability for scalable indexing of very large and rapidly growing collections of local features for instance retrieval. By extending the NV-tree, a scalable disk-based high-dimensional index, we show how to implement the ACID properties of transactions which ensure both dynamicity and durability. We present a detailed performance evaluation of the transactional NV-tree: (i) We show that the insertion throughput is excellent despite the overhead for enforcing the ACID properties; (ii) We also show that this transactional index is truly scalable using a standard image benchmark embedded in collections of up to 28.5 billion high-dimensional vectors; the largest single-server evaluations reported in the literature.