Giovanni Maria Farinella

CV
h-index43
63papers
6,218citations
Novelty36%
AI Score57

63 Papers

CVNov 30, 2023Code
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani et al. · cmu, gatech

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/

CVOct 3, 2022Code
A Multi Camera Unsupervised Domain Adaptation Pipeline for Object Detection in Cultural Sites through Adversarial Learning and Self-Training

Giovanni Pasqualino, Antonino Furnari, Giovanni Maria Farinella

Object detection algorithms allow to enable many interesting applications which can be implemented in different devices, such as smartphones and wearable devices. In the context of a cultural site, implementing these algorithms in a wearable device, such as a pair of smart glasses, allow to enable the use of augmented reality (AR) to show extra information about the artworks and enrich the visitors' experience during their tour. However, object detection algorithms require to be trained on many well annotated examples to achieve reasonable results. This brings a major limitation since the annotation process requires human supervision which makes it expensive in terms of time and costs. A possible solution to reduce these costs consist in exploiting tools to automatically generate synthetic labeled images from a 3D model of the site. However, models trained with synthetic data do not generalize on real images acquired in the target scenario in which they are supposed to be used. Furthermore, object detectors should be able to work with different wearable devices or different mobile devices, which makes generalization even harder. In this paper, we present a new dataset collected in a cultural site to study the problem of domain adaptation for object detection in the presence of multiple unlabeled target domains corresponding to different cameras and a labeled source domain obtained considering synthetic images for training purposes. We present a new domain adaptation method which outperforms current state-of-the-art approaches combining the benefits of aligning the domains at the feature and pixel level with a self-training process. We release the dataset at the following link https://iplab.dmi.unict.it/OBJ-MDA/ and the code of the proposed architecture at https://github.com/fpv-iplab/STMDA-RetinaNet.

CVAug 14, 2023
An Outlook into the Future of Egocentric Vision

Chiara Plizzari, Gabriele Goletto, Antonino Furnari et al.

What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.

CVSep 27, 2022
Visual Object Tracking in First Person Vision

Matteo Dunnhofer, Antonino Furnari, Giovanni Maria Farinella et al.

The understanding of human-object interactions is fundamental in First Person Vision (FPV). Visual tracking algorithms which follow the objects manipulated by the camera wearer can provide useful information to effectively model such interactions. In the last years, the computer vision community has significantly improved the performance of tracking algorithms for a large variety of target objects and scenarios. Despite a few previous attempts to exploit trackers in the FPV domain, a methodical analysis of the performance of state-of-the-art trackers is still missing. This research gap raises the question of whether current solutions can be used ``off-the-shelf'' or more domain-specific investigations should be carried out. This paper aims to provide answers to such questions. We present the first systematic investigation of single object tracking in FPV. Our study extensively analyses the performance of 42 algorithms including generic object trackers and baseline FPV-specific trackers. The analysis is carried out by focusing on different aspects of the FPV setting, introducing new performance measures, and in relation to FPV-specific tasks. The study is made possible through the introduction of TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences. Our results show that object tracking in FPV poses new challenges to current visual trackers. We highlight the factors causing such behavior and point out possible research directions. Despite their difficulties, we prove that trackers bring benefits to FPV downstream tasks requiring short-term object tracking. We expect that generic object tracking will gain popularity in FPV as new and FPV-specific methodologies are investigated.

CVSep 19, 2022
MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

Francesco Ragusa, Antonino Furnari, Giovanni Maria Farinella

Wearable cameras allow to acquire images and videos from the user's perspective. These data can be processed to understand humans behavior. Despite human behavior analysis has been thoroughly investigated in third person vision, it is still understudied in egocentric settings and in particular in industrial scenarios. To encourage research in this field, we present MECCANO, a multimodal dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view, such as recognizing and anticipating human-object interactions. With the MECCANO dataset, we explored five different tasks including 1) Action Recognition, 2) Active Objects Detection and Recognition, 3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and 5) Next-Active Objects Detection. We propose a benchmark aimed to study human behavior in the considered industrial-like scenario which demonstrates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms. To support research in this field, we publicy release the dataset at https://iplab.dmi.unict.it/MECCANO/.

CVApr 8, 2023
StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

Francesco Ragusa, Giovanni Maria Farinella, Antonino Furnari

Anticipation problem has been studied considering different aspects such as predicting humans' locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Please see the project web page for code and additional details: https://iplab.dmi.unict.it/stillfast/.

CVJun 21, 2023
Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario

Rosario Leonardi, Francesco Ragusa, Antonino Furnari et al.

In this paper, we tackle the problem of Egocentric Human-Object Interaction (EHOI) detection in an industrial setting. To overcome the lack of public datasets in this context, we propose a pipeline and a tool for generating synthetic images of EHOIs paired with several annotations and data signals (e.g., depth maps or segmentation masks). Using the proposed pipeline, we present EgoISM-HOI a new multimodal dataset composed of synthetic EHOI images in an industrial environment with rich annotations of hands and objects. To demonstrate the utility and effectiveness of synthetic EHOI data produced by the proposed tool, we designed a new method that predicts and combines different multimodal signals to detect EHOIs in RGB images. Our study shows that exploiting synthetic data to pre-train the proposed method significantly improves performance when tested on real-world data. Moreover, to fully understand the usefulness of our method, we conducted an in-depth analysis in which we compared and highlighted the superiority of the proposed approach over different state-of-the-art class-agnostic methods. To support research in this field, we publicly release the datasets, source code, and pre-trained models at https://iplab.dmi.unict.it/egoism-hoi.

CVApr 14, 2022
Egocentric Human-Object Interaction Detection Exploiting Synthetic Data

Rosario Leonardi, Francesco Ragusa, Antonino Furnari et al.

We consider the problem of detecting Egocentric HumanObject Interactions (EHOIs) in industrial contexts. Since collecting and labeling large amounts of real images is challenging, we propose a pipeline and a tool to generate photo-realistic synthetic First Person Vision (FPV) images automatically labeled for EHOI detection in a specific industrial scenario. To tackle the problem of EHOI detection, we propose a method that detects the hands, the objects in the scene, and determines which objects are currently involved in an interaction. We compare the performance of our method with a set of state-of-the-art baselines. Results show that using a synthetic dataset improves the performance of an EHOI detection system, especially when few real data are available. To encourage research on this topic, we publicly release the proposed dataset at the following url: https://iplab.dmi.unict.it/EHOI_SYNTH/.

CVMay 29
Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark

Maria Santos-Villafranca, Jesus Bermudez-cameo, Alejandro Perez-Yus et al.

To operate in the physical world, embodied agents must perceive their environment in an "always-on" fashion, selectively accessing the most informative sensors to balance energy constraints and task accuracy. Despite its importance for resource-constrained devices, energy-aware perception remains under-explored, with most prior work assuming unlimited compute. To address this, we introduce Ego-METAS: the first Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark. Ego-METAS provides a unified testbed of more than 100 hours of untrimmed egocentric video from EgoExo4D, CMU-MMAC, and CaptainCook4D, spanning 5 modalities (RGB, audio, gaze, IMU, and monochrome camera). We formulate an online temporal action segmentation task where models must dynamically select which sensors to activate at each timestep while strictly adhering to hardware-representative energy budgets. Alongside the benchmark, we release unified splits, cleaned annotations, pre-extracted features, and a diverse suite of baseline routing policies. Our evaluations show that optimal routing is highly scenario-dependent, and that existing policy-learning methods, designed primarily for trimmed clips, struggle to adapt to continuous, untrimmed environments. However, even simple dynamic fusion of complementary modalities (e.g., via random routing) proves critical for balancing predictive accuracy against strict energy budgets. Ultimately, Ego-METAS provides a standardized foundation to develop robust, cost-aware policies for autonomous, always-on embodied AI.

CVApr 14, 2022
Weakly Supervised Attended Object Detection Using Gaze Data as Annotations

Michele Mazzamuto, Francesco Ragusa, Antonino Furnari et al.

We consider the problem of detecting and recognizing the objects observed by visitors (i.e., attended objects) in cultural sites from egocentric vision. A standard approach to the problem involves detecting all objects and selecting the one which best overlaps with the gaze of the visitor, measured through a gaze tracker. Since labeling large amounts of data to train a standard object detector is expensive in terms of costs and time, we propose a weakly supervised version of the task which leans only on gaze data and a frame-level label indicating the class of the attended object. To study the problem, we present a new dataset composed of egocentric videos and gaze coordinates of subjects visiting a museum. We hence compare three different baselines for weakly supervised attended object detection on the collected data. Results show that the considered approaches achieve satisfactory performance in a weakly supervised manner, which allows for significant time savings with respect to a fully supervised detector based on Faster R-CNN. To encourage research on the topic, we publicly release the code and the dataset at the following url: https://iplab.dmi.unict.it/WS_OBJ_DET/

CVSep 26, 2023
ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto et al.

ENIGMA-51 is a new egocentric dataset acquired in an industrial scenario by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and equipments (e.g., oscilloscope). The 51 egocentric video sequences are densely annotated with a rich set of labels that enable the systematic study of human behavior in the industrial domain. We provide benchmarks on four tasks related to human behavior: 1) untrimmed temporal detection of human-object interactions, 2) egocentric human-object interaction detection, 3) short-term object interaction anticipation and 4) natural language understanding of intents and entities. Baseline results show that the ENIGMA-51 dataset poses a challenging benchmark to study human behavior in industrial scenarios. We publicly release the dataset at https://iplab.dmi.unict.it/ENIGMA-51.

CVJun 29, 2023
Streaming egocentric action anticipation: An evaluation scheme and approach

Antonino Furnari, Giovanni Maria Farinella

Egocentric action anticipation aims to predict the future actions the camera wearer will perform from the observation of the past. While predictions about the future should be available before the predicted events take place, most approaches do not pay attention to the computational time required to make such predictions. As a result, current evaluation schemes assume that predictions are available right after the input video is observed, i.e., presuming a negligible runtime, which may lead to overly optimistic evaluations. We propose a streaming egocentric action evaluation scheme which assumes that predictions are performed online and made available only after the model has processed the current input segment, which depends on its runtime. To evaluate all models considering the same prediction horizon, we hence propose that slower models should base their predictions on temporal segments sampled ahead of time. Based on the observation that model runtime can affect performance in the considered streaming evaluation scenario, we further propose a lightweight action anticipation model based on feed-forward 3D CNNs which is optimized using knowledge distillation techniques with a novel past-to-future distillation loss. Experiments on the three popular datasets EPIC-KITCHENS-55, EPIC-KITCHENS-100 and EGTEA Gaze+ show that (i) the proposed evaluation scheme induces a different ranking on state-of-the-art methods as compared to classic evaluations, (ii) lightweight approaches tend to outmatch more computationally expensive ones, and (iii) the proposed model based on feed-forward 3D CNNs and knowledge distillation outperforms current art in the streaming egocentric action anticipation scenario.

CVApr 4
Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Daniele Materia, Francesco Ragusa, Giovanni Maria Farinella

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

CVApr 14, 2022
Panoptic Segmentation using Synthetic and Real Data

Camillo Quattrocchi, Daniele Di Mauro, Antonino Furnari et al.

Being able to understand the relations between the user and the surrounding environment is instrumental to assist users in a worksite. For instance, understanding which objects a user is interacting with from images and video collected through a wearable device can be useful to inform the worker on the usage of specific objects in order to improve productivity and prevent accidents. Despite modern vision systems can rely on advanced algorithms for object detection, semantic and panoptic segmentation, these methods still require large quantities of domain-specific labeled data, which can be difficult to obtain in industrial scenarios. Motivated by this observation, we propose a pipeline which allows to generate synthetic images from 3D models of real environments and real objects. The generated images are automatically labeled and hence effortless to obtain. Exploiting the proposed pipeline, we generate a dataset comprising synthetic images automatically labeled for panoptic segmentation. This set is complemented by a small number of manually labeled real images for fine-tuning. Experiments show that the use of synthetic images allows to drastically reduce the number of real images needed to obtain reasonable panoptic segmentation performance.

CVMar 10
ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto et al.

Understanding human behavior from complementary egocentric (ego) and exocentric (exo) points of view enables the development of systems that can support workers in industrial environments and enhance their safety. However, progress in this area is hindered by the lack of datasets capturing both views in realistic industrial scenarios. To address this gap, we propose ENIGMA-360, a new ego-exo dataset acquired in a real industrial scenario. The dataset is composed of 180 egocentric and 180 exocentric procedural videos temporally synchronized offering complementary information of the same scene. The 360 videos have been labeled with temporal and spatial annotations, enabling the study of different aspects of human behavior in industrial domain. We provide baseline experiments for 3 foundational tasks for human behavior understanding: 1) Temporal Action Segmentation, 2) Keystep Recognition and 3) Egocentric Human-Object Interaction Detection, showing the limits of state-of-the-art approaches on this challenging scenario. These results highlight the need for new models capable of robust ego-exo understanding in real-world environments. We publicly release the dataset and its annotations at https://fpv-iplab.github.io/ENIGMA-360/.

CVDec 16, 2025
SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition

Alessia Micieli, Giovanni Maria Farinella, Francesco Ragusa

In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.

CVDec 15, 2025
Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance

Francesco Ragusa, Michele Mazzamuto, Rosario Forte et al.

We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: https://fpv-iplab.github.io/Ego-EXTRA/.

CVJan 14
GlovEgo-HOI: Bridging the Synthetic-to-Real Gap for Industrial Egocentric Human-Object Interaction Detection

Alfio Spoto, Rosario Leonardi, Francesco Ragusa et al.

Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.

CVJan 28
ProSkill: Segment-Level Skill Assessment in Procedural Videos

Michele Mazzamuto, Daniele Di Mauro, Gianpiero Francesca et al.

Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focus on either pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce ProSkill, the first benchmark dataset for action-level skill assessment in procedural tasks. ProSkill provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of ProSkill in the context of skill assessment for procedural videos. All data and code are available at https://fpv-iplab.github.io/ProSkill/

CVJul 5, 2024
ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero-Campo et al.

Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+δ mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.

CVDec 5, 2023Code
Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro et al.

We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and Temporal Action Segmentation model level. Experiments on Assembly101 and EgoExo4D demonstrate the effectiveness of the proposed method against classic unsupervised domain adaptation and temporal alignment approaches. Without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99 improvement in the edit score (28.59 vs 12.60) on the Assembly101 dataset compared to a baseline model trained solely on exocentric data. In similar settings, our method also improves edit score by +3.32 on the challenging EgoExo4D benchmark. Code is available here: https://github.com/fpv-iplab/synchronization-is-all-you-need.

CVMay 18
EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

Rosario Leonardi, Francesco Ragusa, Daniele Materia et al.

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

CVFeb 25, 2025Code
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos

Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

CVJun 3, 2024Code
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges' weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed framework, we also introduce a feature-based approach, aiming to predict task graphs from key-step textual or video embeddings, for which we observe emerging video understanding abilities. Task graphs learned with our approach are also shown to significantly enhance online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on the Assembly101-O and EPIC-Tent-O datasets. Code for replicating experiments is available at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

CVAug 4, 2020Code
An Unsupervised Domain Adaptation Scheme for Single-Stage Artwork Recognition in Cultural Sites

Giovanni Pasqualino, Antonino Furnari, Giovanni Signorello et al.

Recognizing artworks in a cultural site using images acquired from the user's point of view (First Person Vision) allows to build interesting applications for both the visitors and the site managers. However, current object detection algorithms working in fully supervised settings need to be trained with large quantities of labeled data, whose collection requires a lot of times and high costs in order to achieve good performance. Using synthetic data generated from the 3D model of the cultural site to train the algorithms can reduce these costs. On the other hand, when these models are tested with real images, a significant drop in performance is observed due to the differences between real and synthetic images. In this study we consider the problem of Unsupervised Domain Adaptation for object detection in cultural sites. To address this problem, we created a new dataset containing both synthetic and real images of 16 different artworks. We hence investigated different domain adaptation techniques based on one-stage and two-stage object detector, image-to-image translation and feature alignment. Based on the observation that single-stage detectors are more robust to the domain shift in the considered settings, we proposed a new method which builds on RetinaNet and feature alignment that we called DA-RetinaNet. The proposed approach achieves better results than compared methods on the proposed dataset and on Cityscapes. To support research in this field we release the dataset at the following link https://iplab.dmi.unict.it/EGO-CH-OBJ-UDA/ and the code of the proposed architecture at https://github.com/fpv-iplab/DA-RetinaNet.

CVJun 18, 2020Code
SceneAdapt: Scene-based domain adaptation for semantic segmentation using adversarial learning

Daniele Di Mauro, Antonino Furnari, Giuseppe Patanè et al.

Semantic segmentation methods have achieved outstanding performance thanks to deep learning. Nevertheless, when such algorithms are deployed to new contexts not seen during training, it is necessary to collect and label scene-specific data in order to adapt them to the new domain using fine-tuning. This process is required whenever an already installed camera is moved or a new camera is introduced in a camera network due to the different scene layouts induced by the different viewpoints. To limit the amount of additional training data to be collected, it would be ideal to train a semantic segmentation method using labeled data already available and only unlabeled data coming from the new camera. We formalize this problem as a domain adaptation task and introduce a novel dataset of urban scenes with the related semantic labels. As a first approach to address this challenging task, we propose SceneAdapt, a method for scene adaptation of semantic segmentation algorithms based on adversarial learning. Experiments and comparisons with state-of-the-art approaches to domain adaptation highlight that promising performance can be achieved using adversarial learning both when the two scenes have different but points of view, and when they comprise images of completely different scenes. To encourage research on this topic, we made our code available at our web page: https://iplab.dmi.unict.it/ParkSmartSceneAdaptation/.

CVMay 22, 2019Code
What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention

Antonino Furnari, Giovanni Maria Farinella

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-Kitchens dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. Our method is ranked first in the public leaderboard of the EPIC-Kitchens egocentric action anticipation challenge 2019. Please see our web pages for code and examples: http://iplab.dmi.unict.it/rulstm - https://github.com/fpv-iplab/rulstm.

CVDec 6, 2023
Action Scene Graphs for Long-Form Understanding of Egocentric Videos

Ivan Rodin, Antonino Furnari, Kyle Min et al.

We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and the code to replicate experiments and annotations.

CVApr 2, 2024
PREGO: online mistake detection in PRocedural EGOcentric videos

Alessandro Flaborea, Guido Maria D'Amely di Melendugno, Leonardo Plini et al.

Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields, such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur, which calls for one-class classifiers trained on correctly executed procedures. However, no technique can currently detect open-set procedural mistakes online. We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action, and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets, Assembly101 and Epic-tent, which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, respectively.

CVNov 4, 2024
TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Leonardo Plini, Luca Scofano, Edoardo De Matteis et al.

Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

CVDec 5, 2023
Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

Rosario Leonardi, Antonino Furnari, Francesco Ragusa et al.

In this study, we investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only 10% of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: +5.67% on EPIC-KITCHENS VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Data, code, and data generation tools to support future research are released at: https://fpv-iplab.github.io/HOI-Synth/.

CVNov 25, 2024
Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari et al.

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

CVJul 16, 2025
Calisthenics Skills Temporal Video Segmentation

Antonio Finocchiaro, Giovanni Maria Farinella, Antonino Furnari

Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.

CVMar 31
Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection

Rosario Leonardi, Antonino Furnari, Francesco Ragusa et al.

In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: https://fpv-iplab.github.io/HOI-Synth/.

CVSep 30, 2025
Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Nicola Messina, Rosario Leonardi, Luca Ciampi et al.

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations -- natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.

CVJul 22, 2025
Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video

Alessandro Sebastiano Catinello, Giovanni Maria Farinella, Antonino Furnari

This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally sparse positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with transformer-based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla transformer and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.

CVJul 17, 2025
A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

Antonio Finocchiaro, Alessandro Sebastiano Catinello, Michele Mazzamuto et al.

Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.

CVJul 16, 2025
Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation

Antonio Finocchiaro, Giovanni Maria Farinella, Antonino Furnari

Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.

CVJun 19, 2025
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?

Giuseppe Lando, Rosario Forte, Giovanni Maria Farinella et al.

We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 10**4/10**5 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.

CVNov 23, 2024
Semantically Guided Action Anticipation

Anxhelo Diko, Antonino Furnari, Luigi Cinque et al.

Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. Our method defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. We empirically demonstrate our method's superiority in domain adaptation tasks across four diverse image and video datasets. Remarkably, we surpass previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.

CVJun 12, 2024
Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

Michele Mazzamuto, Antonino Furnari, Yoichi Sato et al.

We address the challenge of unsupervised mistake detection in egocentric video of skilled human activities through the analysis of gaze signals. While traditional methods rely on manually labeled mistakes, our approach does not require mistake annotations, hence overcoming the need of domain-specific labeled data. Based on the observation that eye movements closely follow object manipulation activities, we assess to what extent eye-gaze signals can support mistake detection, proposing to identify deviations in attention patterns measured through a gaze tracker with respect to those estimated by a gaze prediction model. Since predicting gaze in video is characterized by high uncertainty, we propose a novel gaze completion task, where eye fixations are predicted from visual observations and partial gaze trajectories, and contribute a novel gaze completion approach which explicitly models correlations between gaze information and local visual tokens. Inconsistencies between predicted and observed gaze trajectories act as an indicator to identify mistakes. Experiments highlight the effectiveness of the proposed approach in different settings, with relative gains up to +14%, +11%, and +5% in EPIC-Tent, HoloAssist and IndustReal respectively, remarkably matching results of supervised approaches without seeing any labels. We further show that gaze-based analysis is particularly useful in the presence of skilled actions, low action execution confidence, and actions requiring hand-eye coordination and object manipulation skills. Our method is ranked first on the HoloAssist Mistake Detection challenge.

CVJun 3, 2024
AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Josechu Guerrero et al.

Short-Term object-interaction Anticipation consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human robot interaction to understand the user goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1. We propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair. 2. We introduce two novel modules to ground STA predictions on human behavior by modeling affordances.First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant relative Overall Top-5 mAP improvements of up to +45% on Ego4D and +42% on a novel set of curated EPIC-Kitchens STA labels. We will release the code, annotations, and pre extracted affordances on Ego4D and EPIC- Kitchens to encourage future research in this area.

LGMay 4, 2024
PhilHumans: Benchmarking Machine Learning for Personal Health

Vadim Liventsev, Vivek Kumar, Allmin Pradhap Singh Susaiyah et al.

The use of machine learning in Healthcare has the potential to improve patient outcomes as well as broaden the reach and affordability of Healthcare. The history of other application areas indicates that strong benchmarks are essential for the development of intelligent systems. We present Personal Health Interfaces Leveraging HUman-MAchine Natural interactions (PhilHumans), a holistic suite of benchmarks for machine learning across different Healthcare settings - talk therapy, diet coaching, emergency care, intensive care, obstetric sonography - as well as different learning settings, such as action anticipation, timeseries modeling, insight mining, language modeling, computer vision, reinforcement learning and program synthesis

CVFeb 8, 2022
Untrimmed Action Anticipation

Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis et al.

Egocentric action anticipation consists in predicting a future action the camera wearer will perform from egocentric video. While the task has recently attracted the attention of the research community, current approaches assume that the input videos are "trimmed", meaning that a short video sequence is sampled a fixed time before the beginning of the action. We argue that, despite the recent advances in the field, trimmed action anticipation has a limited applicability in real-world scenarios where it is important to deal with "untrimmed" video inputs and it cannot be assumed that the exact moment in which the action will begin is known at test time. To overcome such limitations, we propose an untrimmed action anticipation task, which, similarly to temporal action detection, assumes that the input video is untrimmed at test time, while still requiring predictions to be made before the actions actually take place. We design an evaluation procedure for methods designed to address this novel task, and compare several baselines on the EPIC-KITCHENS-100 dataset. Experiments show that the performance of current models designed for trimmed action anticipation is very limited and more research on this task is required.

ROFeb 2, 2022
Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation

Marco Rosano, Antonino Furnari, Luigi Gulino et al.

Navigating complex indoor environments requires a deep understanding of the space the robotic agent is acting into to correctly inform the navigation process of the agent towards the goal location. In recent learning-based navigation approaches, the scene understanding and navigation abilities of the agent are achieved simultaneously by collecting the required experience in simulation. Unfortunately, even if simulators represent an efficient tool to train navigation policies, the resulting models often fail when transferred into the real world. One possible solution is to provide the navigation model with mid-level visual representations containing important domain-invariant properties of the scene. But, what are the best representations that facilitate the transfer of a model to the real-world? How can they be combined? In this work we address these issues by proposing a benchmark of Deep Learning architectures to combine a range of mid-level visual representations, to perform a PointGoal navigation task following a Reinforcement Learning setup. All the proposed navigation models have been trained with the Habitat simulator on a synthetic office environment and have been tested on the same real-world environment using a real robotic platform. To efficiently assess their performance in a real context, a validation tool has been proposed to generate realistic navigation episodes inside the simulator. Our experiments showed that navigation models can benefit from the multi-modal input and that our validation tool can provide good estimation of the expected navigation performance in the real world, while saving time and resources. The acquired synthetic and real 3D models of the environment, together with the code of our validation tool built on top of Habitat, are publicly available at the following link: https://iplab.dmi.unict.it/EmbodiedVN/

CVOct 13, 2021
Ego4D: Around the World in 3,000 Hours of Egocentric Video

Kristen Grauman, Andrew Westbury, Eugene Byrne et al.

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

CVOct 11, 2021
Towards Streaming Egocentric Action Anticipation

Antonino Furnari, Giovanni Maria Farinella

Egocentric action anticipation is the task of predicting the future actions a camera wearer will likely perform based on past video observations. While in a real-world system it is fundamental to output such predictions before the action begins, past works have not generally paid attention to model runtime during evaluation. Indeed, current evaluation schemes assume that predictions can be made offline, and hence that computational resources are not limited. In contrast, in this paper, we propose a "streaming" egocentric action anticipation evaluation protocol which explicitly considers model runtime for performance assessment, assuming that predictions will be available only after the current video segment is processed, which depends on the processing time of a method. Following the proposed evaluation scheme, we benchmark different state-of-the-art approaches for egocentric action anticipation on two popular datasets. Our analysis shows that models with a smaller runtime tend to outperform heavier models in the considered streaming scenario, thus changing the rankings generally observed in standard offline evaluations. Based on this observation, we propose a lightweight action anticipation model consisting in a simple feed-forward 3D CNN, which we propose to optimize using knowledge distillation techniques and a custom loss. The results show that the proposed approach outperforms prior art in the streaming scenario, also in combination with other lightweight models.

CVAug 31, 2021
Is First Person Vision Challenging for Object Tracking?

Matteo Dunnhofer, Antonino Furnari, Giovanni Maria Farinella et al.

Understanding human-object interactions is fundamental in First Person Vision (FPV). Tracking algorithms which follow the objects manipulated by the camera wearer can provide useful cues to effectively model such interactions. Visual tracking solutions available in the computer vision literature have significantly improved their performance in the last years for a large variety of target objects and tracking scenarios. However, despite a few previous attempts to exploit trackers in FPV applications, a methodical analysis of the performance of state-of-the-art trackers in this domain is still missing. In this paper, we fill the gap by presenting the first systematic study of object tracking in FPV. Our study extensively analyses the performance of recent visual trackers and baseline FPV trackers with respect to different aspects and considering a new performance measure. This is achieved through TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences. Our results show that object tracking in FPV is challenging, which suggests that more research efforts should be devoted to this problem so that tracking could benefit FPV tasks.

CVJul 28, 2021
Predicting the Future from First Person (Egocentric) Vision: A Survey

Ivan Rodin, Antonino Furnari, Dimitrios Mavroedis et al.

Egocentric videos can bring a lot of information about how humans perceive the world and interact with the environment, which can be beneficial for the analysis of human behaviour. The research in egocentric video analysis is developing rapidly thanks to the increasing availability of wearable devices and the opportunities offered by new large-scale egocentric datasets. As computer vision techniques continue to develop at an increasing pace, the tasks related to the prediction of future are starting to evolve from the need of understanding the present. Predicting future human activities, trajectories and interactions with objects is crucial in applications such as human-robot interaction, assistive wearable technologies for both industrial and daily living scenarios, entertainment and virtual or augmented reality. This survey summarises the evolution of studies in the context of future prediction from egocentric vision making an overview of applications, devices, existing problems, commonly used datasets, models and input modalities. Our analysis highlights that methods for future prediction from egocentric vision can have a significant impact in a range of applications and that further research efforts should be devoted to the standardisation of tasks and the proposal of datasets considering real-world scenarios such as the ones with an industrial vocation.

ROJun 22, 2021
A Survey on Human-aware Robot Navigation

Ronja Möller, Antonino Furnari, Sebastiano Battiato et al.

Intelligent systems are increasingly part of our everyday lives and have been integrated seamlessly to the point where it is difficult to imagine a world without them. Physical manifestations of those systems on the other hand, in the form of embodied agents or robots, have so far been used only for specific applications and are often limited to functional roles (e.g. in the industry, entertainment and military fields). Given the current growth and innovation in the research communities concerned with the topics of robot navigation, human-robot-interaction and human activity recognition, it seems like this might soon change. Robots are increasingly easy to obtain and use and the acceptance of them in general is growing. However, the design of a socially compliant robot that can function as a companion needs to take various areas of research into account. This paper is concerned with the navigation aspect of a socially-compliant robot and provides a survey of existing solutions for the relevant areas of research as well as an outlook on possible future directions.