Angelo Cangelosi

RO
h-index48
42papers
1,320citations
Novelty38%
AI Score52

42 Papers

CLMar 11, 2022
WLASL-LEX: a Dataset for Recognising Phonological Properties in American Sign Language

Federico Tavella, Viktor Schlegel, Marta Romeo et al.

Signed Language Processing (SLP) concerns the automated processing of signed languages, the main means of communication of Deaf and hearing impaired individuals. SLP features many different tasks, ranging from sign recognition to translation and production of signed speech, but has been overlooked by the NLP community thus far. In this paper, we bring to attention the task of modelling the phonology of sign languages. We leverage existing resources to construct a large-scale dataset of American Sign Language signs annotated with six different phonological properties. We then conduct an extensive empirical study to investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties. We find that, despite the inherent challenges of the task, graph-based neural networks that operate over skeleton features extracted from raw videos are able to succeed at the task to a varying degree. Most importantly, we show that this performance pertains even on signs unobserved during training.

ROSep 1, 2022
CASPER: Cognitive Architecture for Social Perception and Engagement in Robots

Samuele Vinanzi, Angelo Cangelosi

Our world is being increasingly pervaded by intelligent robots with varying degrees of autonomy. To seamlessly integrate themselves in our society, these machines should possess the ability to navigate the complexities of our daily routines even in the absence of a human's direct input. In other words, we want these robots to understand the intentions of their partners with the purpose of predicting the best way to help them. In this paper, we present CASPER (Cognitive Architecture for Social Perception and Engagement in Robots): a symbolic cognitive architecture that uses qualitative spatial reasoning to anticipate the pursued goal of another agent and to calculate the best collaborative behavior. This is performed through an ensemble of parallel processes that model a low-level action recognition and a high-level goal understanding, both of which are formally verified. We have tested this architecture in a simulated kitchen environment and the results we have collected show that the robot is able to both recognize an ongoing goal and to properly collaborate towards its achievement. This demonstrates a new use of Qualitative Spatial Relations applied to the problem of intention reading in the domain of human-robot interaction.

LGAug 21, 2023
To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Carlo Mazzola, Marta Romeo, Francesco Rea et al.

Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.

LGFeb 11, 2023
Towards Multi-User Activity Recognition through Facilitated Training Data and Deep Learning for Human-Robot Collaboration Applications

Francesco Semeraro, Jon Carberry, Angelo Cangelosi

Human-robot interaction (HRI) research is progressively addressing multi-party scenarios, where a robot interacts with more than one human user at the same time. Conversely, research is still at an early stage for human-robot collaboration. The use of machine learning techniques to handle such type of collaboration requires data that are less feasible to produce than in a typical HRC setup. This work outlines scenarios of concurrent tasks for non-dyadic HRC applications. Based upon these concepts, this study also proposes an alternative way of gathering data regarding multi-user activity, by collecting data related to single users and merging them in post-processing, to reduce the effort involved in producing recordings of pair settings. To validate this statement, 3D skeleton poses of activity of single users were collected and merged in pairs. After this, such datapoints were used to separately train a long short-term memory (LSTM) network and a variational autoencoder (VAE) composed of spatio-temporal graph convolutional networks (STGCN) to recognise the joint activities of the pairs of people. The results showed that it is possible to make use of data collected in this way for pair HRC settings and get similar performances compared to using training data regarding groups of users recorded under the same settings, relieving from the technical difficulties involved in producing these data. The related code and collected data are publicly available.

ROSep 12, 2022
Signs of Language: Embodied Sign Language Fingerspelling Acquisition from Demonstrations for Human-Robot Interaction

Federico Tavella, Aphrodite Galata, Angelo Cangelosi

Learning fine-grained movements is a challenging topic in robotics, particularly in the context of robotic hands. One specific instance of this challenge is the acquisition of fingerspelling sign language in robots. In this paper, we propose an approach for learning dexterous motor imitation from video examples without additional information. To achieve this, we first build a URDF model of a robotic hand with a single actuator for each joint. We then leverage pre-trained deep vision models to extract the 3D pose of the hand from RGB videos. Next, using state-of-the-art reinforcement learning algorithms for motion imitation (namely, proximal policy optimization and soft actor-critic), we train a policy to reproduce the movement extracted from the demonstrations. We identify the optimal set of hyperparameters for imitation based on a reference motion. Finally, we demonstrate the generalizability of our approach by testing it on six different tasks, corresponding to fingerspelled letters. Our results show that our approach is able to successfully imitate these fine-grained movements without additional information, highlighting its potential for real-world applications in robotics.

ROJul 10, 2023
Proceeding of the 1st Workshop on Social Robots Personalisation At the crossroads between engineering and humanities (CONCATENATE)

Imene Tarakli, Georgios Angelopoulos, Mehdi Hellou et al.

Nowadays, robots are expected to interact more physically, cognitively, and socially with people. They should adapt to unpredictable contexts alongside individuals with various behaviours. For this reason, personalisation is a valuable attribute for social robots as it allows them to act according to a specific user's needs and preferences and achieve natural and transparent robot behaviours for humans. If correctly implemented, personalisation could also be the key to the large-scale adoption of social robotics. However, achieving personalisation is arduous as it requires us to expand the boundaries of robotics by taking advantage of the expertise of various domains. Indeed, personalised robots need to analyse and model user interactions while considering their involvement in the adaptative process. It also requires us to address ethical and socio-cultural aspects of personalised HRI to achieve inclusive and diverse interaction and avoid deception and misplaced trust when interacting with the users. At the same time, policymakers need to ensure regulations in view of possible short-term and long-term adaptive HRI. This workshop aims to raise an interdisciplinary discussion on personalisation in robotics. It aims at bringing researchers from different fields together to propose guidelines for personalisation while addressing the following questions: how to define it - how to achieve it - and how it should be guided to fit legal and ethical requirements.

18.7ROApr 13
Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

Zhegong Shangguan, Alessandro Di Nuovo, Angelo Cangelosi

Robots are increasingly entering human-interactive scenarios that require understanding of quantity. How intelligent systems acquire abstract numerical concepts from sensorimotor experience remains a fundamental challenge in cognitive science and artificial intelligence. Here we investigate embodied numerical learning using a neural network model trained to perform sequential counting through naturalistic robotic interaction with a Franka Panda manipulator. We demonstrate that embodied models achieve 96.8\% counting accuracy with only 10\% of training data, compared to 60.6\% for vision-only baselines. This advantage persists when visual-motor correspondences are randomized, indicating that embodiment functions as a structural prior that regularizes learning rather than as an information source. The model spontaneously develops biologically plausible representations: number-selective units with logarithmic tuning, mental number line organization, Weber-law scaling, and rotational dynamics encoding numerical magnitude ($r = 0.97$, slope $= 30.6°$/count). The learning trajectory parallels children's developmental progression from subset-knowers to cardinal-principle knowers. These findings demonstrate that minimal embodiment can ground abstract concepts, improve data efficiency, and yield interpretable representations aligned with biological cognition, which may contribute to embodied mathematics tutoring and safety-critical industrial applications.

RONov 7, 2023
ToP-ToM: Trust-aware Robot Policy with Theory of Mind

Chuang Yu, Baris Serhan, Angelo Cangelosi

Theory of Mind (ToM) is a fundamental cognitive architecture that endows humans with the ability to attribute mental states to others. Humans infer the desires, beliefs, and intentions of others by observing their behavior and, in turn, adjust their actions to facilitate better interpersonal communication and team collaboration. In this paper, we investigated trust-aware robot policy with the theory of mind in a multiagent setting where a human collaborates with a robot against another human opponent. We show that by only focusing on team performance, the robot may resort to the reverse psychology trick, which poses a significant threat to trust maintenance. The human's trust in the robot will collapse when they discover deceptive behavior by the robot. To mitigate this problem, we adopt the robot theory of mind model to infer the human's trust beliefs, including true belief and false belief (an essential element of ToM). We designed a dynamic trust-aware reward function based on different trust beliefs to guide the robot policy learning, which aims to balance between avoiding human trust collapse due to robot reverse psychology. The experimental results demonstrate the importance of the ToM-based robot policy for human-robot trust and the effectiveness of our robot ToM-based robot policy in multiagent interaction settings.

49.0CVApr 2
Hierarchical, Interpretable, Label-Free Concept Bottleneck Model

Haodong Xie, Yujun Cai, Rahul Singh Maharjan et al.

Concept Bottleneck Models (CBMs) introduce interpretability to black-box deep learning models by predicting labels through human-understandable concepts. However, unlike humans, who identify objects at different levels of abstraction using both general and specific features, existing CBMs operate at a single semantic level in both concept and label space. We propose HIL-CBM, a Hierarchical Interpretable Label-Free Concept Bottleneck Model that extends CBMs into a hierarchical framework to enhance interpretability by more closely mirroring the human cognitive process. HIL-CBM enables classification and explanation across multiple semantic levels without requiring relational concept annotations. HIL-CBM aligns the abstraction level of concept-based explanations with that of model predictions, progressing from abstract to concrete. This is achieved by (i) introducing a gradient-based visual consistency loss that encourages abstraction layers to focus on similar spatial regions, and (ii) training dual classification heads, each operating on feature concepts at different abstraction levels. Experiments on benchmark datasets demonstrate that HIL-CBM outperforms state-of-the-art sparse CBMs in classification accuracy. Human evaluations further show that HIL-CBM provides more interpretable and accurate explanations, while maintaining a hierarchical and label-free approach to feature concepts.

75.0ROApr 7
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Theodor Wulff, Federico Tavella, Rahul Singh Maharjan et al.

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.

LGOct 7, 2023
LIPEx-Locally Interpretable Probabilistic Explanations-To Look Beyond The True Class

Hongbo Zhu, Angelo Cangelosi, Procheta Sen et al.

In this work, we instantiate a novel perturbation-based multi-class explanation framework, LIPEx (Locally Interpretable Probabilistic Explanation). We demonstrate that LIPEx not only locally replicates the probability distributions output by the widely used complex classification models but also provides insight into how every feature deemed to be important affects the prediction probability for each of the possible classes. We achieve this by defining the explanation as a matrix obtained via regression with respect to the Hellinger distance in the space of probability distributions. Ablation tests on text and image data, show that LIPEx-guided removal of important features from the data causes more change in predictions for the underlying model than similar tests based on other saliency-based or feature importance-based Explainable AI (XAI) methods. It is also shown that compared to LIME, LIPEx is more data efficient in terms of using a lesser number of perturbations of the data to obtain a reliable explanation. This data-efficiency is seen to manifest as LIPEx being able to compute its explanation matrix around 53% faster than all-class LIME, for classification experiments with text data.

CVJul 8, 2024
Noise-Free Explanation for Driving Action Prediction

Hongbo Zhu, Theodor Wulff, Rahul Singh Maharjan et al.

Although attention mechanisms have achieved considerable progress in Transformer-based architectures across various Artificial Intelligence (AI) domains, their inner workings remain to be explored. Existing explainable methods have different emphases but are rather one-sided. They primarily analyse the attention mechanisms or gradient-based attribution while neglecting the magnitudes of input feature values or the skip-connection module. Moreover, they inevitably bring spurious noisy pixel attributions unrelated to the model's decision, hindering humans' trust in the spotted visualization result. Hence, we propose an easy-to-implement but effective way to remedy this flaw: Smooth Noise Norm Attention (SNNA). We weigh the attention by the norm of the transformed value vector and guide the label-specific signal with the attention gradient, then randomly sample the input perturbations and average the corresponding gradients to produce noise-free attribution. Instead of evaluating the explanation method on the binary or multi-class classification tasks like in previous works, we explore the more complex multi-label classification scenario in this work, i.e., the driving action prediction task, and trained a model for it specifically. Both qualitative and quantitative evaluation results show the superiority of SNNA compared to other SOTA attention-based explainable methods in generating a clearer visual explanation map and ranking the input pixel importance.

LGAug 10, 2025
Revisiting Data Attribution for Influence Functions

Hongbo Zhu, Angelo Cangelosi

The goal of data attribution is to trace the model's predictions through the learning algorithm and back to its training data. thereby identifying the most influential training samples and understanding how the model's behavior leads to particular predictions. Understanding how individual training examples influence a model's predictions is fundamental for machine learning interpretability, data debugging, and model accountability. Influence functions, originating from robust statistics, offer an efficient, first-order approximation to estimate the impact of marginally upweighting or removing a data point on a model's learned parameters and its subsequent predictions, without the need for expensive retraining. This paper comprehensively reviews the data attribution capability of influence functions in deep learning. We discuss their theoretical foundations, recent algorithmic advances for efficient inverse-Hessian-vector product estimation, and evaluate their effectiveness for data attribution and mislabel detection. Finally, highlighting current challenges and promising directions for unleashing the huge potential of influence functions in large-scale, real-world deep learning scenarios.

AIApr 14, 2025
Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Tien Pham, Angelo Cangelosi

Current approaches in Explainable Deep Reinforcement Learning have limitations in which the attention mask has a displacement with the objects in visual input. This work addresses a spatial problem within traditional Convolutional Neural Networks (CNNs). We propose the Interpretable Feature Extractor (IFE) architecture, aimed at generating an accurate attention mask to illustrate both "what" and "where" the agent concentrates on in the spatial domain. Our design incorporates a Human-Understandable Encoding module to generate a fully interpretable attention mask, followed by an Agent-Friendly Encoding module to enhance the agent's learning efficiency. These two components together form the Interpretable Feature Extractor for vision-based deep reinforcement learning to enable the model's interpretability. The resulting attention mask is consistent, highly understandable by humans, accurate in spatial dimension, and effectively highlights important objects or locations in visual input. The Interpretable Feature Extractor is integrated into the Fast and Data-efficient Rainbow framework, and evaluated on 57 ATARI games to show the effectiveness of the proposed approach on Spatial Preservation, Interpretability, and Data-efficiency. Finally, we showcase the versatility of our approach by incorporating the IFE into the Asynchronous Advantage Actor-Critic Model.

AIOct 7, 2025
The Safety Challenge of World Models for Embodied AI Agents: A Review

Lorenzo Baraldi, Zifan Zeng, Chongzhe Zhang et al.

The rapid progress in embodied artificial intelligence has highlighted the necessity for more advanced and integrated models that can perceive, interpret, and predict environmental dynamics. In this context, World Models (WMs) have been introduced to provide embodied agents with the abilities to anticipate future environmental states and fill in knowledge gaps, thereby enhancing agents' ability to plan and execute actions. However, when dealing with embodied agents it is fundamental to ensure that predictions are safe for both the agent and the environment. In this article, we conduct a comprehensive literature review of World Models in the domains of autonomous driving and robotics, with a specific focus on the safety implications of scene and control generation tasks. Our review is complemented by an empirical analysis, wherein we collect and examine predictions from state-of-the-art models, identify and categorize common faults (herein referred to as pathologies), and provide a quantitative evaluation of the results.

ROSep 5, 2025
DeGuV: Depth-Guided Visual Reinforcement Learning for Generalization and Interpretability in Manipulation

Tien Pham, Xinyun Chi, Khang Nguyen et al.

Reinforcement learning (RL) agents can learn to solve complex tasks from visual inputs, but generalizing these learned skills to new environments remains a major challenge in RL application, especially robotics. While data augmentation can improve generalization, it often compromises sample efficiency and training stability. This paper introduces DeGuV, an RL framework that enhances both generalization and sample efficiency. In specific, we leverage a learnable masker network that produces a mask from the depth input, preserving only critical visual information while discarding irrelevant pixels. Through this, we ensure that our RL agents focus on essential features, improving robustness under data augmentation. In addition, we incorporate contrastive learning and stabilize Q-value estimation under augmentation to further enhance sample efficiency and training stability. We evaluate our proposed method on the RL-ViGen benchmark using the Franka Emika robot and demonstrate its effectiveness in zero-shot sim-to-real transfer. Our results show that DeGuV outperforms state-of-the-art methods in both generalization and sample efficiency while also improving interpretability by highlighting the most relevant regions in the visual input

CVAug 10, 2025
Representation Understanding via Activation Maximization

Hongbo Zhu, Angelo Cangelosi

Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.

ROJun 24, 2025
Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning

Federico Tavella, Amber Drinkwater, Angelo Cangelosi

Vision-Language Models (VLMs) have emerged as powerful tools for generating textual descriptions from visual data. While these models excel on web-scale datasets, their robustness to the domain shifts inherent in many real-world applications remains under-explored. This paper presents a systematic evaluation of VLM performance on a single-view object captioning task when faced with a controlled, physical domain shift. We compare captioning accuracy across two distinct object sets: a collection of multi-material, real-world tools and a set of single-material, 3D-printed items. The 3D-printed set introduces a significant domain shift in texture and material properties, challenging the models' generalization capabilities. Our quantitative results demonstrate that all tested VLMs show a marked performance degradation when describing the 3D-printed objects compared to the real-world tools. This underscores a critical limitation in the ability of current models to generalize beyond surface-level features and highlights the need for more robust architectures for real-world signal processing applications.

ROApr 14, 2025
Joint Action Language Modelling for Transparent Policy Execution

Theodor Wulff, Rahul Singh Maharjan, Xinyun Chi et al.

An agent's intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent's behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.

CVApr 9, 2025
Attributes-aware Visual Emotion Representation Learning

Rahul Singh Maharjan, Marta Romeo, Angelo Cangelosi

Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.

AIMar 24, 2025
Towards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems

Jacopo de Berardinis, Lorenzo Porcaro, Albert Meroño-Peñuela et al.

Generative AI is radically changing the creative arts, by fundamentally transforming the way we create and interact with cultural artefacts. While offering unprecedented opportunities for artistic expression and commercialisation, this technology also raises ethical, societal, and legal concerns. Key among these are the potential displacement of human creativity, copyright infringement stemming from vast training datasets, and the lack of transparency, explainability, and fairness mechanisms. As generative systems become pervasive in this domain, responsible design is crucial. Whilst previous work has tackled isolated aspects of generative systems (e.g., transparency, evaluation, data), we take a comprehensive approach, grounding these efforts within the Ethics Guidelines for Trustworthy Artificial Intelligence produced by the High-Level Expert Group on AI appointed by the European Commission - a framework for designing responsible AI systems across seven macro requirements. Focusing on generative music AI, we illustrate how these requirements can be contextualised for the field, addressing trustworthiness across multiple dimensions and integrating insights from the existing literature. We further propose a roadmap for operationalising these contextualised requirements, emphasising interdisciplinary collaboration and stakeholder engagement. Our work provides a foundation for designing and evaluating responsible music generation systems, calling for collaboration among AI experts, ethicists, legal scholars, and artists. This manuscript is accompanied by a website: https://amresearchlab.github.io/raim-framework/.

AIJun 14, 2024
Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Federico Tavella, Aphrodite Galata, Angelo Cangelosi

Artificial agents, particularly humanoid robots, interact with their environment, objects, and people using cameras, actuators, and physical presence. Their communication methods are often pre-programmed, limiting their actions and interactions. Our research explores acquiring non-verbal communication skills through learning from demonstrations, with potential applications in sign language comprehension and expression. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions. Compared to other methods, our approach eliminates the need for additional hardware to acquire information. We demonstrate how the combination of these different techniques offers a viable way to learn sign language. Our methodology successfully teaches 5 different signs involving the upper body (i.e., arms and hands). This research paves the way for advanced communication skills in artificial agents.

ROOct 14, 2021
Human-robot collaboration and machine learning: a systematic review of recent research

Francesco Semeraro, Alexander Griffiths, Angelo Cangelosi

Technological progress increasingly envisions the use of robots interacting with people in everyday life. Human-robot collaboration (HRC) is the approach that explores the interaction between a human and a robot, during the completion of a common objective, at the cognitive and physical level. In HRC works, a cognitive model is typically built, which collects inputs from the environment and from the user, elaborates and translates these into information that can be used by the robot itself. Machine learning is a recent approach to build the cognitive model and behavioural block, with high potential in HRC. Consequently, this paper proposes a thorough literature review of the use of machine learning techniques in the context of human-robot collaboration. 45 key papers were selected and analysed, and a clustering of works based on the type of collaborative tasks, evaluation metrics and cognitive variables modelled is proposed. Then, a deep analysis on different families of machine learning algorithms and their properties, along with the sensing modalities used, is carried out. Among the observations, it is outlined the importance of the machine learning algorithms to incorporate time dependencies. The salient features of these works are then cross-analysed to show trends in HRC and give guidelines for future works, comparing them with other aspects of HRC not appeared in the review.

CLOct 1, 2021
Phonology Recognition in American Sign Language

Federico Tavella, Aphrodite Galata, Angelo Cangelosi

Inspired by recent developments in natural language processing, we propose a novel approach to sign language processing based on phonological properties validated by American Sign Language users. By taking advantage of datasets composed of phonological data and people speaking sign language, we use a pretrained deep model based on mesh reconstruction to extract the 3D coordinates of the signers keypoints. Then, we train standard statistical and deep machine learning models in order to assign phonological classes to each temporal sequence of coordinates. Our paper introduces the idea of exploiting the phonological properties manually assigned by sign language users to classify videos of people performing signs by regressing a 3D mesh. We establish a new baseline for this problem based on the statistical distribution of 725 different signs. Our best-performing models achieve a micro-averaged F1-score of 58% for the major location class and 70% for the sign type using statistical and deep learning algorithms, compared to their corresponding baselines of 35% and 39%.

ROMay 7, 2021
The Challenges and Opportunities of Human-Centered AI for Trustworthy Robots and Autonomous Systems

Hongmei He, John Gray, Angelo Cangelosi et al.

The trustworthiness of Robots and Autonomous Systems (RAS) has gained a prominent position on many research agendas towards fully autonomous systems. This research systematically explores, for the first time, the key facets of human-centered AI (HAI) for trustworthy RAS. In this article, five key properties of a trustworthy RAS initially have been identified. RAS must be (i) safe in any uncertain and dynamic surrounding environments; (ii) secure, thus protecting itself from any cyber-threats; (iii) healthy with fault tolerance; (iv) trusted and easy to use to allow effective human-machine interaction (HMI), and (v) compliant with the law and ethical expectations. Then, the challenges in implementing trustworthy autonomous system are analytically reviewed, in respects of the five key properties, and the roles of AI technologies have been explored to ensure the trustiness of RAS with respects to safety, security, health and HMI, while reflecting the requirements of ethics in the design of RAS. While applications of RAS have mainly focused on performance and productivity, the risks posed by advanced AI in RAS have not received sufficient scientific attention. Hence, a new acceptance model of RAS is provided, as a framework for requirements to human-centered AI and for implementing trustworthy RAS by design. This approach promotes human-level intelligence to augment human's capacity. while focusing on contributions to humanity.

ROJan 26, 2021
When Would You Trust a Robot? A Study on Trust and Theory of Mind in Human-Robot Interactions

Wenxuan Mou, Martina Ruocco, Debora Zanatto et al.

Trust is a critical issue in Human Robot Interactions as it is the core of human desire to accept and use a non human agent. Theory of Mind has been defined as the ability to understand the beliefs and intentions of others that may differ from one's own. Evidences in psychology and HRI suggest that trust and Theory of Mind are interconnected and interdependent concepts, as the decision to trust another agent must depend on our own representation of this entity's actions, beliefs and intentions. However, very few works take Theory of Mind of the robot into consideration while studying trust in HRI. In this paper, we investigated whether the exposure to the Theory of Mind abilities of a robot could affect humans' trust towards the robot. To this end, participants played a Price Game with a humanoid robot that was presented having either low level Theory of Mind or high level Theory of Mind. Specifically, the participants were asked to accept the price evaluations on common objects presented by the robot. The willingness of the participants to change their own price judgement of the objects (i.e., accept the price the robot suggested) was used as the main measurement of the trust towards the robot. Our experimental results showed that robots possessing a high level of Theory of Mind abilities were trusted more than the robots presented with low level Theory of Mind skills.

IRAug 26, 2020
At Your Service: Coffee Beans Recommendation From a Robot Assistant

Jacopo de Berardinis, Gabriella Pizzuto, Francesco Lanza et al.

With advances in the field of machine learning, precisely algorithms for recommendation systems, robot assistants are envisioned to become more present in the hospitality industry. Additionally, the COVID-19 pandemic has also highlighted the need to have more service robots in our everyday lives, to minimise the risk of human to-human transmission. One such example would be coffee shops, which have become intrinsic to our everyday lives. However, serving an excellent cup of coffee is not a trivial feat as a coffee blend typically comprises rich aromas, indulgent and unique flavours and a lingering aftertaste. Our work addresses this by proposing a computational model which recommends optimal coffee beans resulting from the user's preferences. Specifically, given a set of coffee bean properties (objective features), we apply different supervised learning techniques to predict coffee qualities (subjective features). We then consider an unsupervised learning method to analyse the relationship between coffee beans in the subjective feature space. Evaluated on a real coffee beans dataset based on digitised reviews, our results illustrate that the proposed computational model gives up to 92.7 percent recommendation accuracy for coffee beans prediction. From this, we propose how this computational model can be deployed on a service robot to reliably predict customers' coffee bean preferences, starting from the user inputting their coffee preferences to the robot recommending the coffee beans that best meet the user's likings.

LGAug 5, 2020
A robot that counts like a child: a developmental model of counting and pointing

Leszek Pecyna, Angelo Cangelosi, Alessandro Di Nuovo

In this paper, a novel neuro-robotics model capable of counting real items is introduced. The model allows us to investigate the interaction between embodiment and numerical cognition. This is composed of a deep neural network capable of image processing and sequential tasks performance, and a robotic platform providing the embodiment - the iCub humanoid robot. The network is trained using images from the robot's cameras and proprioceptive signals from its joints. The trained model is able to count a set of items and at the same time points to them. We investigate the influence of pointing on the counting process and compare our results with those from studies with children. Several training approaches are presented in this paper all of them uses pre-training routine allowing the network to gain the ability of pointing and number recitation (from 1 to 10) prior to counting training. The impact of the counted set size and distance to the objects are investigated. The obtained results on counting performance show similarities with those from human studies.

NEJun 20, 2020
Towards a self-organizing pre-symbolic neural model representing sensorimotor primitives

Junpei Zhong, Angelo Cangelosi, Stefan Wermter

The acquisition of symbolic and linguistic representations of sensorimotor behavior is a cognitive process performed by an agent when it is executing and/or observing own and others' actions. According to Piaget's theory of cognitive development, these representations develop during the sensorimotor stage and the pre-operational stage. We propose a model that relates the conceptualization of the higher-level information from visual stimuli to the development of ventral/dorsal visual streams. This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model. We exemplify this model through a robot passively observing an object to learn its features and movements. During the learning process of observing sensorimotor primitives, i.e. observing a set of trajectories of arm movements and its oriented object features, the pre-symbolic representation is self-organized in the parametric units. These representational units act as bifurcation parameters, guiding the robot to recognize and predict various learned sensorimotor primitives. The pre-symbolic representation also accounts for the learning of sensorimotor primitives in a latent learning context.

CVAug 20, 2019
Human activity recognition from skeleton poses

Frederico Belmonte Klein, Angelo Cangelosi

Human Action Recognition is an important task of Human Robot Interaction as cooperation between robots and humans requires that artificial agents recognise complex cues from the environment. A promising approach is using trained classifiers to recognise human actions through sequences of skeleton poses extracted from images or RGB-D data from a sensor. However, with many different data-sets focused on slightly different sets of actions and different algorithms it is not clear which strategy produces highest accuracy for indoor activities performed in a home environment. This work discussed, tested and compared classic algorithms, namely, support vector machines and k-nearest neighbours, to 2 similar hierarchical neural gas approaches, the growing when required neural gas and the growing neural gas.

ROAug 15, 2019
Sample-efficient Deep Reinforcement Learning with Imaginary Rollouts for Human-Robot Interaction

Mohammad Thabet, Massimiliano Patacchiola, Angelo Cangelosi

Deep reinforcement learning has proven to be a great success in allowing agents to learn complex tasks. However, its application to actual robots can be prohibitively expensive. Furthermore, the unpredictability of human behavior in human-robot interaction tasks can hinder convergence to a good policy. In this paper, we present an architecture that allows agents to learn models of stochastic environments and use them to accelerate learning. We descirbe how an environment model can be learned online and used to generate synthetic transitions, as well as how an agent can leverage these synthetic data to accelerate learning. We validate our approach using an experiment in which a robotic arm has to complete a task composed of a series of actions based on human gestures. Results show that our approach leads to significantly faster learning, requiring much less interaction with the environment. Furthermore, we demonstrate how learned models can be used by a robot to produce optimal plans in real world applications.

CVJul 9, 2019
Influence of Pointing on Learning to Count: A Neuro-Robotics Model

Leszek Pecyna, Angelo Cangelosi

In this paper a neuro-robotics model capable of counting using gestures is introduced. The contribution of gestures to learning to count is tested with various model and training conditions. Two studies were presented in this article. In the first, we combine different modalities of the robot's neural network, in the second, a novel training procedure for it is proposed. The model is trained with pointing data from an iCub robot simulator. The behaviour of the model is in line with that of human children in terms of performance change depending on gesture production.

CVJul 9, 2019
A Deep Neural Network for Finger Counting and Numerosity Estimation

Leszek Pecyna, Angelo Cangelosi, Alessandro Di Nuovo

In this paper, we present neuro-robotics models with a deep artificial neural network capable of generating finger counting positions and number estimation. We first train the model in an unsupervised manner where each layer is treated as a Restricted Boltzmann Machine or an autoencoder. Such a model is further trained in a supervised way. This type of pre-training is tested on our baseline model and two methods of pre-training are compared. The network is extended to produce finger counting positions. The performance in number estimation of such an extended model is evaluated. We test the hypothesis if the subitizing process can be obtained by one single model used also for estimation of higher numerosities. The results confirm the importance of unsupervised training in our enumeration task and show some similarities to human behaviour in the case of subitizing.

AIApr 17, 2018
Encoding Longer-term Contextual Multi-modal Information in a Predictive Coding Model

Junpei Zhong, Tetsuya Ogata, Angelo Cangelosi

Studies suggest that within the hierarchical architecture, the topological higher level possibly represents a conscious category of the current sensory events with slower changing activities. They attempt to predict the activities on the lower level by relaying the predicted information. On the other hand, the incoming sensory information corrects such prediction of the events on the higher level by the novel or surprising signal. We propose a predictive hierarchical artificial neural network model that examines this hypothesis on neurorobotic platforms, based on the AFA-PredNet model. In this neural network model, there are different temporal scales of predictions exist on different levels of the hierarchical predictive coding, which are defined in the temporal parameters in the neurons. Also, both the fast and the slow-changing neural activities are modulated by the active motor activities. A neurorobotic experiment based on the architecture was also conducted based on the data collected from the VRep simulator.

ROApr 11, 2018
AFA-PredNet: The action modulation within predictive coding

Junpei Zhong, Angelo Cangelosi, Xinzheng Zhang et al.

The predictive processing (PP) hypothesizes that the predictive inference of our sensorimotor system is encoded implicitly in the regularities between perception and action. We propose a neural architecture in which such regularities of active inference are encoded hierarchically. We further suggest that this encoding emerges during the embodied learning process when the appropriate action is selected to minimize the prediction error in perception. Therefore, this predictive stream in the sensorimotor loop is generated in a top-down manner. Specifically, it is constantly modulated by the motor actions and is updated by the bottom-up prediction error signals. In this way, the top-down prediction originally comes from the prior experience from both perception and action representing the higher levels of this hierarchical cognition. In our proposed embodied model, we extend the PredNet Network, a hierarchical predictive coding network, with the motor action units implemented by a multi-layer perceptron network (MLP) to modulate the network top-down prediction. Two experiments, a minimalistic world experiment, and a mobile robot experiment are conducted to evaluate the proposed model in a qualitative way. In the neural representation, it can be observed that the causal inference of predictive percept from motor actions can be also observed while the agent is interacting with the environment.

CVSep 12, 2017
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers

Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez et al.

Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline.

AISep 11, 2017
Autonomous Quadrotor Landing using Deep Reinforcement Learning

Riccardo Polvara, Massimiliano Patacchiola, Sanjay Sharma et al.

Landing an unmanned aerial vehicle (UAV) on a ground marker is an open problem despite the effort of the research community. Previous attempts mostly focused on the analysis of hand-crafted geometric features and the use of external sensors in order to allow the vehicle to approach the land-pad. In this article, we propose a method based on deep reinforcement learning that only requires low-resolution images taken from a down-looking camera in order to identify the position of the marker and land the UAV on it. The proposed approach is based on a hierarchy of Deep Q-Networks (DQNs) used as high-level control policy for the navigation toward the marker. We implemented different technical solutions, such as the combination of vanilla and double DQNs, and a partitioned buffer replay. Using domain randomization we trained the vehicle on uniform textures and we tested it on a large variety of simulated and real-world environments. The overall performance is comparable with a state-of-the-art algorithm and human pilots.

NEJun 8, 2017
Where is my forearm? Clustering of body parts from simultaneous tactile and linguistic input using sequential mapping

Karla Stepanova, Matej Hoffmann, Zdenek Straka et al.

Humans and animals are constantly exposed to a continuous stream of sensory information from different modalities. At the same time, they form more compressed representations like concepts or symbols. In species that use language, this process is further structured by this interaction, where a mapping between the sensorimotor concepts and linguistic elements needs to be established. There is evidence that children might be learning language by simply disambiguating potential meanings based on multiple exposures to utterances in different contexts (cross-situational learning). In existing models, the mapping between modalities is usually found in a single step by directly using frequencies of referent and meaning co-occurrences. In this paper, we present an extension of this one-step mapping and introduce a newly proposed sequential mapping algorithm together with a publicly available Matlab implementation. For demonstration, we have chosen a less typical scenario: instead of learning to associate objects with their names, we focus on body representations. A humanoid robot is receiving tactile stimulations on its body, while at the same time listening to utterances of the body part names (e.g., hand, forearm and torso). With the goal at arriving at the correct "body categories", we demonstrate how a sequential mapping algorithm outperforms one-step mapping. In addition, the effect of data set size and noise in the linguistic input are studied.

NEFeb 7, 2017
Toward Abstraction from Multi-modal Data: Empirical Studies on Multiple Time-scale Recurrent Models

Junpei Zhong, Angelo Cangelosi, Tetsuya Ogata

The abstraction tasks are challenging for multi- modal sequences as they require a deeper semantic understanding and a novel text generation for the data. Although the recurrent neural networks (RNN) can be used to model the context of the time-sequences, in most cases the long-term dependencies of multi-modal data make the back-propagation through time training of RNN tend to vanish in the time domain. Recently, inspired from Multiple Time-scale Recurrent Neural Network (MTRNN), an extension of Gated Recurrent Unit (GRU), called Multiple Time-scale Gated Recurrent Unit (MTGRU), has been proposed to learn the long-term dependencies in natural language processing. Particularly it is also able to accomplish the abstraction task for paragraphs given that the time constants are well defined. In this paper, we compare the MTRNN and MTGRU in terms of its learning performances as well as their abstraction representation on higher level (with a slower neural activation). This was done by conducting two studies based on a smaller data- set (two-dimension time sequences from non-linear functions) and a relatively large data-set (43-dimension time sequences from iCub manipulation tasks with multi-modal data). We conclude that gated recurrent mechanisms may be necessary for learning long-term dependencies in large dimension multi-modal data-sets (e.g. learning of robot manipulation), even when natural language commands was not involved. But for smaller learning tasks with simple time-sequences, generic version of recurrent models, such as MTRNN, were sufficient to accomplish the abstraction task.

ROMay 11, 2016
A Hierarchical Emotion Regulated Sensorimotor Model: Case Studies

Junpei Zhong, Rony Novianto, Mingjun Dai et al.

Inspired by the hierarchical cognitive architecture and the perception-action model (PAM), we propose that the internal status acts as a kind of common-coding representation which affects, mediates and even regulates the sensorimotor behaviours. These regulation can be depicted in the Bayesian framework, that is why cognitive agents are able to generate behaviours with subtle differences according to their emotion or recognize the emotion by perception. A novel recurrent neural network called recurrent neural network with parametric bias units (RNNPB) runs in three modes, constructing a two-level emotion regulated learning model, was further applied to testify this theory in two different cases.

ROMay 11, 2016
Sensorimotor Input as a Language Generalisation Tool: A Neurorobotics Model for Generation and Generalisation of Noun-Verb Combinations with Sensorimotor Inputs

Junpei Zhong, Martin Peniak, Jun Tani et al.

The paper presents a neurorobotics cognitive model to explain the understanding and generalisation of nouns and verbs combinations when a vocal command consisting of a verb-noun sentence is provided to a humanoid robot. This generalisation process is done via the grounding process: different objects are being interacted, and associated, with different motor behaviours, following a learning approach inspired by developmental language acquisition in infants. This cognitive model is based on Multiple Time-scale Recurrent Neural Networks (MTRNN).With the data obtained from object manipulation tasks with a humanoid robot platform, the robotic agent implemented with this model can ground the primitive embodied structure of verbs through training with verb-noun combination samples. Moreover, we show that a functional hierarchical architecture, based on MTRNN, is able to generalise and produce novel combinations of noun-verb sentences. Further analyses of the learned network dynamics and representations also demonstrate how the generalisation is possible via the exploitation of this functional hierarchical recurrent network.

CLJun 10, 2015
A cognitive neural architecture able to learn and communicate through natural language

Bruno Golosio, Angelo Cangelosi, Olesya Gamotina et al.

Communicative interactions involve a kind of procedural knowledge that is used by the human brain for processing verbal and nonverbal inputs and for language production. Although considerable work has been done on modeling human language abilities, it has been difficult to bring them together to a comprehensive tabula rasa system compatible with current knowledge of how verbal information is processed in the brain. This work presents a cognitive system, entirely based on a large-scale neural architecture, which was developed to shed light on the procedural knowledge involved in language elaboration. The main component of this system is the central executive, which is a supervising system that coordinates the other components of the working memory. In our model, the central executive is a neural network that takes as input the neural activation states of the short-term memory and yields as output mental actions, which control the flow of information among the working memory components through neural gating mechanisms. The proposed system is capable of learning to communicate through natural language starting from tabula rasa, without any a priori knowledge of the structure of phrases, meaning of words, role of the different classes of words, only by interacting with a human through a text-based interface, using an open-ended incremental learning process. It is able to learn nouns, verbs, adjectives, pronouns and other word classes, and to use them in expressive language. The model was validated on a corpus of 1587 input sentences, based on literature on early language assessment, at the level of about 4-years old child, and produced 521 output sentences, expressing a broad range of language processing functionalities.