Heiko Schuldt

MM
9papers
204citations
Novelty17%
AI Score30

9 Papers

CLNov 25, 2025
Mirror, Mirror on the Wall -- Which is the Best Model of Them All?

Dina Sayed, Heiko Schuldt

Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in various domains such as finance, healthcare, education, telecommunications, and law, among others. Typically, state-of-the-art (SOTA) foundation models are developed by large corporations based on large data collections and substantial computational and financial resources required to pretrain such models from scratch. These foundation models then serve as the basis for further development and domain adaptation for specific use cases or tasks. However, given the dynamic and fast-paced nature of launching new foundation models, the process of selecting the most suitable model for a particular use case, application, or domain becomes increasingly complex. We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training: a qualitative dimension (which model is best suited for a task based on information, for instance, taken from model cards) and a quantitative dimension (which is the best performing model). The quantitative performance of models is assessed through leaderboards, which rank models based on standardized benchmarks and provide a consistent framework for comparing different LLMs. In this work, we address the analysis of the quantitative dimension by exploring the current leaderboards and benchmarks. To illustrate this analysis, we focus on the medical domain as a case study, demonstrating the evolution, current landscape, and practical significance of this quantitative evaluation dimension. Finally, we propose a Model Selection Methodology (MSM), a systematic approach designed to guide the navigation, prioritization, and selection of the model that best aligns with a given use case.

MMSep 27, 2019
Query by Semantic Sketch

Luca Rossetto, Ralph Gasser, Heiko Schuldt

Sketch-based query formulation is very common in image and video retrieval as these techniques often complement textual retrieval methods that are based on either manual or machine generated annotations. In this paper, we present a retrieval approach that allows to query visual media collections by sketching concept maps, thereby merging sketch-based retrieval with the search for semantic labels. Users can draw a spatial distribution of different concept labels, such as "sky", "sea" or "person" and then use these sketches to find images or video scenes that exhibit a similar distribution of these concepts. Hence, this approach does not only take the semantic concepts themselves into account, but also their semantic relations as well as their spatial context. The efficient vector representation enables efficient retrieval even in large multimedia collections. We have integrated the semantic sketch query mode into our retrieval engine vitrivr and demonstrated its effectiveness.

MMFeb 11, 2019
Towards an All-Purpose Content-Based Multimedia Information Retrieval System

Ralph Gasser, Luca Rossetto, Heiko Schuldt

The growth of multimedia collections - in terms of size, heterogeneity, and variety of media types - necessitates systems that are able to conjointly deal with several forms of media, especially when it comes to searching for particular objects. However, existing retrieval systems are organized in silos and treat different media types separately. As a consequence, retrieval across media types is either not supported at all or subject to major limitations. In this paper, we present vitrivr, a content-based multimedia information retrieval stack. As opposed to the keyword search approach implemented by most media management systems, vitrivr makes direct use of the object's content to facilitate different types of similarity search, such as Query-by-Example or Query-by-Sketch, for and, most importantly, across different media types - namely, images, audio, videos, and 3D models. Furthermore, we introduce a new web-based user interface that enables easy-to-use, multimodal retrieval from and browsing in mixed media collections. The effectiveness of vitrivr is shown on the basis of a user study that involves different query and media types. To the best of our knowledge, the full vitrivr stack is unique in that it is the first multimedia retrieval system that seamlessly integrates support for four different types of media. As such, it paves the way towards an all-purpose, content-based multimedia information retrieval system.

MMOct 10, 2018
V3C - a Research Video Collection

Luca Rossetto, Heiko Schuldt, George Awad et al.

With the widespread use of smartphones as recording devices and the massive growth in bandwidth, the number and volume of video collections has increased significantly in the last years. This poses novel challenges to the management of these large-scale video data and especially to the analysis of and retrieval from such video collections. At the same time, existing video datasets used for research and experimentation are either not large enough to represent current collections or do not reflect the properties of video commonly found on the Internet in terms of content, length, or resolution. In this paper, we introduce the Vimeo Creative Commons Collection, in short V3C, a collection of 28'450 videos (with overall length of about 3'800 hours) published under creative commons license on Vimeo. V3C comes with a shot segmentation for each video, together with the resulting keyframes in original as well as reduced resolution and additional metadata. It is intended to be used from 2019 at the International large-scale TREC Video Retrieval Evaluation campaign (TRECVid).

MMApr 13, 2018
The PS-Battles Dataset - an Image Collection for Image Manipulation Detection

Silvan Heller, Luca Rossetto, Heiko Schuldt

The boost of available digital media has led to a significant increase in derivative work. With tools for manipulating objects becoming more and more mature, it can be very difficult to determine whether one piece of media was derived from another one or tampered with. As derivations can be done with malicious intent, there is an urgent need for reliable and easily usable tampering detection methods. However, even media considered semantically untampered by humans might have already undergone compression steps or light post-processing, making automated detection of tampering susceptible to false positives. In this paper, we present the PS-Battles dataset which is gathered from a large community of image manipulation enthusiasts and provides a basis for media derivation and manipulation detection in the visual domain. The dataset consists of 102'028 images grouped into 11'142 subsets, each containing the original image as well as a varying number of manipulated derivatives.

HCJan 19, 2018
Proceedings of eNTERFACE 2015 Workshop on Intelligent Interfaces

Matei Mancas, Christian Frisson, Joëlle Tilmanne et al.

The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight selected projects structured around intelligent interfaces. Eight projects were selected and their reports are shown here.

MMJul 5, 2017
Web Video in Numbers - An Analysis of Web-Video Metadata

Luca Rossetto, Heiko Schuldt

Web video is often used as a source of data in various fields of study. While specialized subsets of web video, mainly earmarked for dedicated purposes, are often analyzed in detail, there is little information available about the properties of web video as a whole. In this paper we present insights gained from the analysis of the metadata associated with more than 120 million videos harvested from two popular web video platforms, vimeo and YouTube, in 2016 and compare their properties with the ones found in commonly used video collections. This comparison has revealed that existing collections do not (or no longer) properly reflect the properties of web video "in the wild".