Anthony Ventresque

SE
h-index20
10papers
223citations
Novelty36%
AI Score45

10 Papers

CVJun 30, 2023
Towards the extraction of robust sign embeddings for low resource sign language recognition

Mathieu De Coster, Ellen Rushe, Ruth Holmes et al.

Isolated Sign Language Recognition (SLR) has mostly been applied on datasets containing signs executed slowly and clearly by a limited group of signers. In real-world scenarios, however, we are met with challenging visual conditions, coarticulated signing, small datasets, and the need for signer independent models. To tackle this difficult problem, we require a robust feature extractor to process the sign language videos. One could expect human pose estimators to be ideal candidates. However, due to a domain mismatch with their training sets and challenging poses in sign language, they lack robustness on sign language data and image-based models often still outperform keypoint-based models. Furthermore, whereas the common practice of transfer learning with image-based models yields even higher accuracy, keypoint-based models are typically trained from scratch on every SLR dataset. These factors limit their usefulness for SLR. From the existing literature, it is also not clear which, if any, pose estimator performs best for SLR. We compare the three most popular pose estimators for SLR: OpenPose, MMPose and MediaPipe. We show that through keypoint normalization, missing keypoint imputation, and learning a pose embedding, we can obtain significantly better results and enable transfer learning. We show that keypoint-based embeddings contain cross-lingual features: they can transfer between sign languages and achieve competitive performance even when fine-tuning only the classifier layer of an SLR model on a target sign language. We furthermore achieve better performance using fine-tuned transferred embeddings than models trained only on the target sign language. The embeddings can also be learned in a multilingual fashion. The application of these embeddings could prove particularly useful for low resource sign languages in the future.

63.6AIMar 12Code
TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Mayug Maniparambil, Nils Hoehing, Janak Kapuriya et al.

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.

CVNov 20, 2023
What's left can't be right -- The remaining positional incompetence of contrastive vision-language models

Nils Hoehing, Ellen Rushe, Anthony Ventresque

Contrastive vision-language models like CLIP have been found to lack spatial understanding capabilities. In this paper we discuss the possible causes of this phenomenon by analysing both datasets and embedding space. By focusing on simple left-right positional relations, we show that this behaviour is entirely predictable, even with large-scale datasets, demonstrate that these relations can be taught using synthetic data and show that this approach can generalise well to natural images - improving the performance on left-right relations on Visual Genome Relations.

CVSep 2, 2025Code
Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Nils Hoehing, Mayug Maniparambil, Ellen Rushe et al.

We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience

SEFeb 13, 2025
Metamorphic Testing for Pose Estimation Systems

Matias Duran, Thomas Laurent, Ellen Rushe et al.

Pose estimation systems are used in a variety of fields, from sports analytics to livestock care. Given their potential impact, it is paramount to systematically test their behaviour and potential for failure. This is a complex task due to the oracle problem and the high cost of manual labelling necessary to build ground truth keypoints. This problem is exacerbated by the fact that different applications require systems to focus on different subjects (e.g., human versus animal) or landmarks (e.g., only extremities versus whole body and face), which makes labelled test data rarely reusable. To combat these problems we propose MET-POSE, a metamorphic testing framework for pose estimation systems that bypasses the need for manual annotation while assessing the performance of these systems under different circumstances. MET-POSE thus allows users of pose estimation systems to assess the systems in conditions that more closely relate to their application without having to label an ad-hoc test dataset or rely only on available datasets, which may not be adapted to their application domain. While we define MET-POSE in general terms, we also present a non-exhaustive list of metamorphic rules that represent common challenges in computer vision applications, as well as a specific way to evaluate these rules. We then experimentally show the effectiveness of MET-POSE by applying it to Mediapipe Holistic, a state of the art human pose estimation system, with the FLIC and PHOENIX datasets. With these experiments, we outline numerous ways in which the outputs of MET-POSE can uncover faults in pose estimation systems at a similar or higher rate than classic testing using hand labelled data, and show that users can tailor the rule set they use to the faults and level of accuracy relevant to their application.

83.0CYApr 8
Playing Games with My Heart: An Evaluation of AI Companion Apps

Maribeth Rauh, Dick A. H. Blankvoort, Matias Duran et al.

The use of chatbots for various forms of companionship is growing rapidly, raising a myriad of questions about simulated relationships, emotional dependence, and psychological harm. While major platforms such as ChatGPT, Grok, and Character.AI are the subject of a growing body of research and legal inquiries, apps explicitly built for simulating intimate interpersonal relationships remain under-explored. In this work, we evaluate the five most popular AI companion mobile applications in the EU and UK markets for factors that encourage parasocial interaction and may manipulate users. We do this by manually annotating the user experience each offers. Specifically, we systematically record and quantify design dark patterns, anthropomorphism, stereotypes, erotica, and technical performance issues. We find that all apps contain substantial dark patterns aimed at increasing monetisation and user engagement. Erotica and gamification features such as levelling are also prevalent, and although other features vary considerably between applications, all apps have highly anthropomorphic design. These findings shed light on the mechanics used to leverage users' simulated relationships. On that basis, we put forward concrete recommendations for regulators to strengthen consumer protection in this rapidly emerging market. Content warning: This article contains objectifying images of women, erotic images, textual references to incest, and other potentially sensitive, offensive, and distressing text.

AIMar 18, 2021
MILP for the Multi-objective VM Reassignment Problem

Takfarinas Saber, Anthony Ventresque, Joao Marques-Silva et al.

Machine Reassignment is a challenging problem for constraint programming (CP) and mixed-integer linear programming (MILP) approaches, especially given the size of data centres. The multi-objective version of the Machine Reassignment Problem is even more challenging and it seems unlikely for CP or MILP to obtain good results in this context. As a result, the first approaches to address this problem have been based on other optimisation methods, including metaheuristics. In this paper we study under which conditions a mixed-integer optimisation solver, such as IBM ILOG CPLEX, can be used for the Multi-objective Machine Reassignment Problem. We show that it is useful only for small or medium-scale data centres and with some relaxations, such as an optimality tolerance gap and a limited number of directions explored in the search space. Building on this study, we also investigate a hybrid approach, feeding a metaheuristic with the results of CPLEX, and we show that the gains are important in terms of quality of the set of Pareto solutions (+126.9% against the metaheuristic alone and +17.8% against CPLEX alone) and number of solutions (8.9 times more than CPLEX), while the processing time increases only by 6% in comparison to CPLEX for execution times larger than 100 seconds.

SEOct 2, 2019
A Mutation-based Approach for Assessing Weight Coverage of a Path Planner

Thomas Laurent, Paolo Arcaini, Fuyuki Ishikawa et al.

Autonomous cars are subjected to several different kind of inputs (other cars, road structure, etc.) and, therefore, testing the car under all possible conditions is impossible. To tackle this problem, scenario-based testing for automated driving defines categories of different scenarios that should be covered. Although this kind of coverage is a necessary condition, it still does not guarantee that any possible behaviour of the autonomous car is tested. In this paper, we consider the path planner of an autonomous car that decides, at each timestep, the short-term path to follow in the next few seconds; such decision is done by using a weighted cost function that considers different aspects (safety, comfort, etc.). In order to assess whether all the possible decisions that can be taken by the path planner are covered by a given test suite T, we propose a mutation-based approach that mutates the weights of the cost function and then checks if at least one scenario of T kills the mutant. Preliminary experiments on a manually designed test suite show that some weights are easier to cover as they consider aspects that more likely occur in a scenario, and that more complicated scenarios (that generate more complex paths) are those that allow to cover more weights.

SEJun 7, 2019
Learning Software Configuration Spaces: A Systematic Literature Review

Juliana Alves Pereira, Hugo Martin, Mathieu Acher et al.

Most modern software systems (operating systems like Linux or Android, Web browsers like Firefox or Chrome, video encoders like ffmpeg, x264 or VLC, mobile and cloud applications, etc.) are highly-configurable. Hundreds of configuration options, features, or plugins can be combined, each potentially with distinct functionality and effects on execution time, security, energy consumption, etc. Due to the combinatorial explosion and the cost of executing software, it is quickly impossible to exhaustively explore the whole configuration space. Hence, numerous works have investigated the idea of learning it from a small sample of configurations' measurements. The pattern "sampling, measuring, learning" has emerged in the literature, with several practical interests for both software developers and end-users of configurable systems. In this survey, we report on the different application objectives (e.g., performance prediction, configuration optimization, constraint mining), use-cases, targeted software systems and application domains. We review the various strategies employed to gather a representative and cost-effective sample. We describe automated software techniques used to measure functional and non-functional properties of configurations. We classify machine learning algorithms and how they relate to the pursued application. Finally, we also describe how researchers evaluate the quality of the learning process. The findings from this systematic review show that the potential application objective is important; there are a vast number of case studies reported in the literature from the basis of several domains and software systems. Yet, the huge variant space of configurable systems is still challenging and calls to further investigate the synergies between artificial intelligence and software engineering.

SEJan 11, 2016
Assessing and Improving the Mutation Testing Practice of PIT

Thomas Laurent, Anthony Ventresque, Mike Papadakis et al.

Mutation testing is used extensively to support the experimentation of software engineering studies. Its application to real-world projects is possible thanks to modern tools that automate the whole mutation analysis process. However, popular mutation testing tools use a restrictive set of mutants which do not conform to the community standards as supported by the mutation testing literature. This can be problematic since the effectiveness of mutation depends on its mutants. We therefore examine how effective are the mutants of a popular mutation testing tool, named PIT, compared to comprehensive ones, as drawn from the literature and personal experience. We show that comprehensive mutants are harder to kill and encode faults not captured by the mutants of PIT for a range of 11% to 62% of the Java classes of the considered projects.