Walterio Mayol-Cuevas

CV
h-index26
29papers
495citations
Novelty45%
AI Score42

29 Papers

CVOct 21, 2022
AROS: Affordance Recognition with One-Shot Human Stances

Abel Pacheco-Ortega, Walterio Mayol-Cuevas

We present AROS, a one-shot learning approach that uses an explicit representation of interactions between highly-articulated human poses and 3D scenes. The approach is one-shot as the method does not require re-training to add new affordance instances. Furthermore, only one or a small handful of examples of the target pose are needed to describe the interaction. Given a 3D mesh of a previously unseen scene, we can predict affordance locations that support the interactions and generate corresponding articulated 3D human bodies around them. We evaluate on three public datasets of scans of real environments with varied degrees of noise. Via rigorous statistical analysis of crowdsourced evaluations, results show that our one-shot approach outperforms data-intensive baselines by up to 80\%.

CVDec 10, 2025
From Detection to Anticipation: Online Understanding of Struggles across Various Tasks and Activities

Shijia Feng, Michael Wray, Walterio Mayol-Cuevas

Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.

CVJul 30, 2024
Re-localization acceleration with Medoid Silhouette Clustering

Hongyi Zhang, Walterio Mayol-Cuevas

Two crucial performance criteria for the deployment of visual localization are speed and accuracy. Current research on visual localization with neural networks is limited to examining methods for enhancing the accuracy of networks across various datasets. How to expedite the re-localization process within deep neural network architectures still needs further investigation. In this paper, we present a novel approach for accelerating visual re-localization in practice. A tree-like search strategy, built on the keyframes extracted by a visual clustering algorithm, is designed for matching acceleration. Our method has been validated on two tasks across three public datasets, allowing for 50 up to 90 percent time saving over the baseline while not reducing location accuracy.

CVNov 22, 2022
SuperTran: Reference Based Video Transformer for Enhancing Low Bitrate Streams in Real Time

Tejas Khot, Nataliya Shapovalova, Silviu Andrei et al.

This work focuses on low bitrate video streaming scenarios (e.g. 50 - 200Kbps) where the video quality is severely compromised. We present a family of novel deep generative models for enhancing perceptual video quality of such streams by performing super-resolution while also removing compression artifacts. Our model, which we call SuperTran, consumes as input a single high-quality, high-resolution reference images in addition to the low-quality, low-resolution video stream. The model thus learns how to borrow or copy visual elements like textures from the reference image and fill in the remaining details from the low resolution stream in order to produce perceptually enhanced output video. The reference frame can be sent once at the start of the video session or be retrieved from a gallery. Importantly, the resulting output has substantially better detail than what has been otherwise possible with methods that only use a low resolution input such as the SuperVEGAN method. SuperTran works in real-time (up to 30 frames/sec) on the cloud alongside standard pipelines.

CVOct 1, 2025Code
EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels

Shijia Feng, Michael Wray, Walterio Mayol-Cuevas

The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user's current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities -- tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.

CVJan 12, 2025
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding

Wenqi Zhou, Kai Cao, Hao Zheng et al.

Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short (\eg, minutes to tens of minutes) to moderately long videos, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset meticulously designed to fill this gap by focusing on tasks requiring a comprehensive understanding of extremely long egocentric video recordings. Our X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs spanning from 23 minutes to 16.4 hours. The evaluations of several baseline systems and multimodal large language models (MLLMs) reveal their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding, such as temporal localization and reasoning, context aggregation, and memory retention, and underscoring the need for more advanced models.

CVFeb 16, 2024
Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

Shijia Feng, Michael Wray, Brian Sullivan et al.

Determining when people are struggling allows for a finer-grained understanding of actions that complements conventional action classification and error detection. Struggle detection, as defined in this paper, is a distinct and important task that can be identified without explicit step or activity knowledge. We introduce the first struggle dataset with three real-world problem-solving activities that are labelled by both expert and crowd-source annotators. Video segments were scored w.r.t. their level of struggle using a forced choice 4-point scale. This dataset contains 5.1 hours of video from 73 participants. We conducted a series of experiments to identify the most suitable modelling approaches for struggle determination. Additionally, we compared various deep learning models, establishing baseline results for struggle classification, struggle regression, and struggle label distribution learning. Our results indicate that struggle detection in video can achieve up to $88.24\%$ accuracy in binary classification, while detecting the level of struggle in a four-way classification setting performs lower, with an overall accuracy of $52.45\%$. Our work is motivated toward a more comprehensive understanding of action in video and potentially the improvement of assistive systems that analyse struggle and can better support users during manual activities.

CVFeb 2, 2022
On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Yanan Liu, Laurie Bose, Yao Lu et al.

This work presents a method to implement fully convolutional neural networks (FCNs) on Pixel Processor Array (PPA) sensors, and demonstrates coarse segmentation and object localisation tasks. We design and train binarized FCN for both binary weights and activations using batchnorm, group convolution, and learnable threshold for binarization, producing networks small enough to be embedded on the focal plane of the PPA, with limited local memory resources, and using parallel elementary add/subtract, shifting, and bit operations only. We demonstrate the first implementation of an FCN on a PPA device, performing three convolution layers entirely in the pixel-level processors. We use this architecture to demonstrate inference generating heat maps for object segmentation and localisation at over 280 FPS using the SCAMP-5 PPA vision chip.

CVMay 26, 2021
Direct Servo Control from In-Sensor CNN Inference with A Pixel Processor Array

Yanan Liu, Jianing Chen, Laurie Bose et al.

This work demonstrates direct visual sensory-motor control using high-speed CNN inference via a SCAMP-5 Pixel Processor Array (PPA). We demonstrate how PPAs are able to efficiently bridge the gap between perception and action. A binary Convolutional Neural Network (CNN) is used for a classic rock, paper, scissors classification problem at over 8000 FPS. Control instructions are directly sent to a servo motor from the PPA according to the CNN's classification result without any other intermediate hardware.

ROMay 21, 2021
Bringing A Robot Simulator to the SCAMP Vision System

Yanan Liu, Jianing Chen, Laurie Bose et al.

This work develops and demonstrates the integration of the SCAMP-5d vision system into the CoppeliaSim robot simulator, creating a semi-simulated environment. By configuring a camera in the simulator and setting up communication with the SCAMP python host through remote API, sensor images from the simulator can be transferred to the SCAMP vision sensor, where on-sensor image processing such as CNN inference can be performed. SCAMP output is then fed back into CoppeliaSim. This proposed platform integration enables rapid prototyping validations of SCAMP algorithms for robotic systems. We demonstrate a car localisation and tracking task using this proposed semi-simulated platform, with a CNN inference on SCAMP to command the motion of a robot. We made this platform available online.

CVApr 28, 2021
Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Ramon Izquierdo-Cordova, Walterio Mayol-Cuevas

Neural network designers have reached progressive accuracy by increasing models depth, introducing new layer types and discovering new combinations of layers. A common element in many architectures is the distribution of the number of filters in each layer. Neural network models keep a pattern design of increasing filters in deeper layers such as those in LeNet, VGG, ResNet, MobileNet and even in automatic discovered architectures such as NASNet. It remains unknown if this pyramidal distribution of filters is the best for different tasks and constrains. In this work we present a series of modifications in the distribution of filters in four popular neural network models and their effects in accuracy and resource consumption. Results show that by applying this approach, some models improve up to 8.9% in accuracy showing reductions in parameters up to 54%.

CVApr 17, 2021
Towards Efficient Convolutional Network Models with Filter Distribution Templates

Ramon Izquierdo-Cordova, Walterio Mayol-Cuevas

Increasing number of filters in deeper layers when feature maps are decreased is a widely adopted pattern in convolutional network design. It can be found in classical CNN architectures and in automatic discovered models. Even CNS methods commonly explore a selection of multipliers derived from this pyramidal pattern. We defy this practice by introducing a small set of templates consisting of easy to implement, intuitive and aggressive variations of the original pyramidal distribution of filters in VGG and ResNet architectures. Experiments on CIFAR, CINIC10 and TinyImagenet datasets show that models produced by our templates, are more efficient in terms of fewer parameters and memory needs.

ROSep 27, 2020
Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array

Yanan Liu, Laurie Bose, Colin Greatwood et al.

This paper presents an agile reactive navigation strategy for driving a non-holonomic ground vehicle around a preset course of gates in a cluttered environment using a low-cost processor array sensor. This enables machine vision tasks to be performed directly upon the sensor's image plane, rather than using a separate general-purpose computer. We demonstrate a small ground vehicle running through or avoiding multiple gates at high speed using minimal computational resources. To achieve this, target tracking algorithms are developed for the Pixel Processing Array and captured images are then processed directly on the vision sensor acquiring target information for controlling the ground vehicle. The algorithm can run at up to 2000 fps outdoors and 200fps at indoor illumination levels. Conducting image processing at the sensor level avoids the bottleneck of image transfer encountered in conventional sensors. The real-time performance of on-board image processing and robustness is validated through experiments. Experimental results demonstrate that the algorithm's ability to enable a ground vehicle to navigate at an average speed of 2.20 m/s for passing through multiple gates and 3.88 m/s for a 'slalom' task in an environment featuring significant visual clutter.

CVApr 27, 2020
Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

Laurie Bose, Jianing Chen, Stephen J. Carey et al.

We present a novel method of CNN inference for pixel processor array (PPA) vision sensors, designed to take advantage of their massive parallelism and analog compute capabilities. PPA sensors consist of an array of processing elements (PEs), with each PE capable of light capture, data storage and computation, allowing various computer vision processing to be executed directly upon the sensor device. The key idea behind our approach is storing network weights "in-pixel" within the PEs of the PPA sensor itself to allow various computations, such as multiple different image convolutions, to be carried out in parallel. Our approach can perform convolutional layers, max pooling, ReLu, and a final fully connected layer entirely upon the PPA sensor, while leaving no untapped computational resources. This is in contrast to previous works that only use a sensor-level processing to sequentially compute image convolutions, and must transfer data to an external digital processor to complete the computation. We demonstrate our approach on the SCAMP-5 vision system, performing inference of a MNIST digit classification network at over 3000 frames per second and over 93% classification accuracy. This is the first work demonstrating CNN inference conducted entirely upon the processor array of a PPA vision sensor device, requiring no external processing.

CVDec 13, 2019
Action Modifiers: Learning from Adverbs in Instructional Videos

Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas et al.

We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependant on the action to which it applies, although the same adverb will modify multiple actions in a similar way. For instance, while 'spread quickly' and 'mix quickly' will look dissimilar, we can learn a common representation that allows us to recognize both, among other actions. We formulate this as an embedding problem, and use scaled dot-product attention to learn from weakly-supervised video narrations. We jointly learn adverbs as invertible transformations operating on the embedding space, so as to add or remove the effect of the adverb. As there is no prior work on weakly supervised learning from adverbs, we gather paired action-adverb annotations from a subset of the HowTo100M dataset for 6 adverbs: quickly/slowly, finely/coarsely, and partially/completely. Our method outperforms all baselines for video-to-adverb retrieval with a performance of 0.719 mAP. We also demonstrate our model's ability to attend to the relevant video parts in order to determine the adverb for a given action.

CVSep 12, 2019
A Camera That CNNs: Towards Embedded Neural Networks on Pixel Processor Arrays

Laurie Bose, Jianing Chen, Stephen J. Carey et al.

We present a convolutional neural network implementation for pixel processor array (PPA) sensors. PPA hardware consists of a fine-grained array of general-purpose processing elements, each capable of light capture, data storage, program execution, and communication with neighboring elements. This allows images to be stored and manipulated directly at the point of light capture, rather than having to transfer images to external processing hardware. Our CNN approach divides this array up into 4x4 blocks of processing elements, essentially trading-off image resolution for increased local memory capacity per 4x4 "pixel". We implement parallel operations for image addition, subtraction and bit-shifting images in this 4x4 block format. Using these components we formulate how to perform ternary weight convolutions upon these images, compactly store results of such convolutions, perform max-pooling, and transfer the resulting sub-sampled data to an attached micro-controller. We train ternary weight filter CNNs for digit recognition and a simple tracking task, and demonstrate inference of these networks upon the SCAMP5 PPA system. This work represents a first step towards embedding neural network processing capability directly onto the focal plane of a sensor.

CVJun 13, 2019
Egocentric affordance detection with the one-shot geometry-driven Interaction Tensor

Eduardo Ruiz, Walterio Mayol-Cuevas

In this abstract we describe recent [4,7] and latest work on the determination of affordances in visually perceived 3D scenes. Our method builds on the hypothesis that geometry on its own provides enough information to enable the detection of significant interaction possibilities in the environment. The motivation behind this is that geometric information is intimately related to the physical interactions afforded by objects in the world. The approach uses a generic representation for the interaction between everyday objects such as a mug or an umbrella with the environment, and also for more complex affordances such as humans Sitting or Riding a motorcycle. Experiments with synthetic and real RGB-D scenes show that the representation enables the prediction of affordance candidate locations in novel environments at fast rates and from a single (one-shot) training example. The determination of affordances is a crucial step towards systems that need to perceive and interact with their surroundings. We here illustrate output on two cases for a simulated robot and for an Augmented Reality setting, both perceiving in an egocentric manner.

CVDec 13, 2018
The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos

Hazel Doughty, Walterio Mayol-Cuevas, Dima Damen

We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a method which assesses the relative overall level of skill in a long video by attending to its skill-relevant parts. Our approach trains temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill. We evaluate our approach on the EPIC-Skills dataset and additionally annotate a larger dataset from YouTube videos for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model's ability to attend to rank-aware parts of the video.

CVDec 3, 2018
What can I do here? Leveraging Deep 3D saliency and geometry for fast and scalable multiple affordance detection

Eduardo Ruiz, Walterio Mayol-Cuevas

This paper develops and evaluates a novel method that allows for the detection of affordances in a scalable and multiple-instance manner on visually recovered pointclouds. Our approach has many advantages over alternative methods, as it is based on highly parallelizable, one-shot learning that is fast in commodity hardware. The approach is hybrid in that it uses a geometric representation together with a state-of-the-art deep learning method capable of identifying 3D scene saliency. The geometric component allows for a compact and efficient representation, boosting the performance of the deep network architecture which proved insufficient on its own. Moreover, our approach allows not only to predict whether an input scene affords or not the interactions, but also the pose of the objects that allow these interactions to take place. Our predictions align well with crowd-sourced human judgment as they are preferred with 87% probability, show high rates of improvement with almost four times (4x) better performance over a deep learning-only baseline and are seven times (7x) faster than previous art.

CVSep 15, 2017
Towards CNN map representation and compression for camera relocalisation

Luis Contreras, Walterio Mayol-Cuevas

This paper presents a study on the use of Convolutional Neural Networks for camera relocalisation and its application to map compression. We follow state of the art visual relocalisation results and evaluate the response to different data inputs. We use a CNN map representation and introduce the notion of map compression under this paradigm by using smaller CNN architectures without sacrificing relocalisation performance. We evaluate this approach in a series of publicly available datasets over a number of CNN architectures with different sizes, both in complexity and number of layers. This formulation allows us to improve relocalisation accuracy by increasing the number of training trajectories while maintaining a constant-size CNN.

CVMar 30, 2017
Geometric Affordances from a Single Example via the Interaction Tensor

Eduardo Ruiz, Walterio Mayol-Cuevas

This paper develops and evaluates a new tensor field representation to express the geometric affordance of one object over another. We expand the well known bisector surface representation to one that is weight-driven and that retains the provenance of surface points with directional vectors. We also incorporate the notion of affordance keypoints which allow for faster decisions at a point of query and with a compact and straightforward descriptor. Using a single interaction example, we are able to generalize to previously-unseen scenarios; both synthetic and also real scenes captured with RGBD sensors. We show how our interaction tensor allows for significantly better performance over alternative formulations. Evaluations also include crowdsourcing comparisons that confirm the validity of our affordance proposals, which agree on average 84% of the time with human judgments, and which is 20-40% better than the baseline methods.

CVMar 29, 2017
Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

Hazel Doughty, Dima Damen, Walterio Mayol-Cuevas

We present a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill, and learns shared features when a pair of videos exhibit comparable skill levels. Results demonstrate our method is applicable across tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 83% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters. We see this work as effort toward the automated organization of how-to video collections and overall, generic skill determination in video.

CVMar 27, 2017
Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas et al.

Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to changes in labeled temporal bounds, for object interaction recognition. As boundaries are trespassed, a drop of up to 10% is observed for both Improved Dense Trajectories and Two-Stream Convolutional Neural Network. We demonstrate that such disagreement stems from a limited understanding of the distinct phases of an action, and propose annotating based on the Rubicon Boundaries, inspired by a similarly named cognitive model, for consistent temporal bounds of object interactions. Evaluated on a public dataset, we report a 4% increase in overall accuracy, and an increase in accuracy for 55% of classes when Rubicon Boundaries are used for temporal annotations.

CVMar 24, 2017
Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition

Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas et al.

This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the mapping between observations and interaction classes, as well as class overlaps, towards a probabilistic multi-label classifier that emulates human annotators. Given a video segment containing an object interaction, we model the probability for a verb, out of a list of possible verbs, to be used to annotate that interaction. The proba- bility is learnt from crowdsourced annotations, and is tested on two public datasets, comprising 1405 video sequences for which we provide annotations on 90 verbs. We outper- form conventional single-label classification by 11% and 6% on the two datasets respectively, and show that learning from annotation probabilities outperforms majority voting and enables discovery of co-occurring labels.

CVMar 2, 2017
Towards CNN Map Compression for camera relocalisation

Luis Contreras, Walterio Mayol-Cuevas

This paper presents a study on the use of Convolutional Neural Networks for camera relocalisation and its application to map compression. We follow state of the art visual relocalisation results and evaluate response to different data inputs -- namely, depth, grayscale, RGB, spatial position and combinations of these. We use a CNN map representation and introduce the notion of CNN map compression by using a smaller CNN architecture. We evaluate our proposal in a series of publicly available datasets. This formulation allows us to improve relocalisation accuracy by increasing the number of training trajectories while maintaining a constant-size CNN.

HCDec 29, 2016
Automated capture and delivery of assistive task guidance with an eyewear computer: The GlaciAR system

Teesid Leelasawassuk, Dima Damen, Walterio Mayol-Cuevas

In this paper we describe and evaluate a mixed reality system that aims to augment users in task guidance applications by combining automated and unsupervised information collection with minimally invasive video guides. The result is a self-contained system that we call GlaciAR (Glass-enabled Contextual Interactions for Augmented Reality), that operates by extracting contextual interactions from observing users performing actions. GlaciAR is able to i) automatically determine moments of relevance based on a head motion attention model, ii) automatically produce video guidance information, iii) trigger these video guides based on an object detection method, iv) learn without supervision from observing multiple users and v) operate fully on-board a current eyewear computer (Google Glass). We describe the components of GlaciAR together with evaluations on how users are able to use the system to achieve three tasks. We see this work as a first step toward the development of systems that aim to scale up the notoriously difficult authoring problem in guidance systems and where people's natural abilities are enhanced via minimally invasive visual guidance.

CVJul 28, 2016
SEMBED: Semantic Embedding of Egocentric Action Videos

Michael Wray, Davide Moltisanti, Walterio Mayol-Cuevas et al.

We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely annotated egocentric videos, outperforming SVM classification by more than 5%.

ROJan 18, 2016
Towards an objective evaluation of underactuated gripper designs

Eduardo Ruiz, Walterio Mayol-Cuevas

In this paper we explore state-of-the-art underactuated, compliant robot gripper designs through looking at their performance on a generic grasping task. Starting from a state of the art open gripper design, we propose design modifications,and importantly, evaluate all designs on a grasping experiment involving a selection of objects resulting in 3600 object-gripper interactions. Interested in non-planned grasping but rather on a design's generic performance, we explore the influence of object shape, pose and orientation relative to the gripper and its finger number and configuration. Using open-loop grasps we achieved up to 75% success rate over our trials. The results indicate and support that under motion constraints and uncertainties and without involving grasp planning, a 2-fingered underactuated compliant hand outperforms higher multi-fingered configurations. To our knowledge this is the first extended objective comparison of various multi-fingered underactuated hand designs under generic grasping conditions.

CVOct 16, 2015
You-Do, I-Learn: Unsupervised Multi-User egocentric Approach Towards Video-Based Guidance

Dima Damen, Teesid Leelasawassuk, Walterio Mayol-Cuevas

This paper presents an unsupervised approach towards automatically extracting video-based guidance on object usage, from egocentric video and wearable gaze tracking, collected from multiple users while performing tasks. The approach i) discovers task relevant objects, ii) builds a model for each, iii) distinguishes different ways in which each discovered object has been used and iv) discovers the dependencies between object interactions. The work investigates using appearance, position, motion and attention, and presents results using each and a combination of relevant features. Moreover, an online scalable approach is presented and is compared to offline results. The paper proposes a method for selecting a suitable video guide to be displayed to a novice user indicating how to use an object, purely triggered by the user's gaze. The potential assistive mode can also recommend an object to be used next based on the learnt sequence of object interactions. The approach was tested on a variety of daily tasks such as initialising a printer, preparing a coffee and setting up a gym machine.