CVMar 10
LAP: A Language-Aware Planning Model For Procedure Planning In Instructional VideosLei Shi, Victor Aregbede, Andreas Persson et al.
Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.
AIJan 29
Abstract Concept Modelling in Conceptual Spaces: A Study on Chess StrategiesHadi Banaee, Stephanie Lowry
We present a conceptual space framework for modelling abstract concepts that unfold over time, demonstrated through a chess-based proof-of-concept. Strategy concepts, such as attack or sacrifice, are represented as geometric regions across interpretable quality dimensions, with chess games instantiated and analysed as trajectories whose directional movement toward regions enables recognition of intended strategies. This approach also supports dual-perspective modelling, capturing how players interpret identical situations differently. Our implementation demonstrates the feasibility of trajectory-based concept recognition, with movement patterns aligning with expert commentary. This work explores extending the conceptual spaces theory to temporally realised, goal-directed concepts. The approach establishes a foundation for broader applications involving sequential decision-making and supports integration with knowledge evolution mechanisms for learning and refining abstract concepts over time.
CVSep 25, 2025
Learning Conformal Explainers for Image ClassifiersAmr Alkhatib, Stephanie Lowry
Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model's prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model's predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.
ROApr 19, 2020
Robust Frequency-Based Structure ExtractionTomasz Piotr Kucner, Matteo Luperto, Stephanie Lowry et al.
State of the art mapping algorithms can produce high-quality maps. However, they are still vulnerable to clutter and outliers which can affect map quality and in consequence hinder the performance of a robot, and further map processing for semantic understanding of the environment. This paper presents ROSE, a method for building-level structure detection in robotic maps. ROSE exploits the fact that indoor environments usually contain walls and straight-line elements along a limited set of orientations. Therefore metric maps often have a set of dominant directions. ROSE extracts these directions and uses this information to segment the map into structure and clutter through filtering the map in the frequency domain (an approach substantially underutilised in the mapping applications). Removing the clutter in this way makes wall detection (e.g. using the Hough transform) more robust. Our experiments demonstrate that (1) the application of ROSE for decluttering can substantially improve structural feature retrieval (e.g., walls) in cluttered environments, (2) ROSE can successfully distinguish between clutter and structure in the map even with substantial amount of noise and (3) ROSE can numerically assess the amount of structure in the map.