CVJul 4, 2023Code
Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion RepresentationsAnil Osman Tur, Nicola Dall'Asen, Cigdem Beyan et al.
This paper aims to address the unsupervised video anomaly detection (VAD) problem, which involves classifying each frame in a video as normal or abnormal, without any access to labels. To accomplish this, the proposed method employs conditional diffusion models, where the input data is the spatiotemporal features extracted from a pre-trained network, and the condition is the features extracted from compact motion representations that summarize a given video segment in terms of its motion and appearance. Our method utilizes a data-driven threshold and considers a high reconstruction error as an indicator of anomalous events. This study is the first to utilize compact motion representations for VAD and the experiments conducted on two large-scale VAD benchmarks demonstrate that they supply relevant information to the diffusion model, and consequently improve VAD performances w.r.t the prior art. Importantly, our method exhibits better generalization performance across different datasets, notably outperforming both the state-of-the-art and baseline methods. The code of our method is available at https://github.com/AnilOsmanTur/conditioned_video_anomaly_diffusion
CVAug 23, 2022Code
Multimodal Across Domains Gaze Target DetectionFrancesco Tonini, Cigdem Beyan, Elisa Ricci
This paper addresses the gaze target detection problem in single images captured from the third-person perspective. We present a multimodal deep architecture to infer where a person in a scene is looking. This spatial model is trained on the head images of the person-of- interest, scene and depth maps representing rich context information. Our model, unlike several prior art, do not require supervision of the gaze angles, do not rely on head orientation information and/or location of the eyes of person-of-interest. Extensive experiments demonstrate the stronger performance of our method on multiple benchmark datasets. We also investigated several variations of our method by altering joint-learning of multimodal data. Some variations outperform a few prior art as well. First time in this paper, we inspect domain adaption for gaze target detection, and we empower our multimodal network to effectively handle the domain gap across datasets. The code of the proposed method is available at https://github.com/francescotonini/multimodal-across-domains-gaze-target-detection.
CVApr 21, 2022Code
Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints InvarianceGiancarlo Paoletti, Jacopo Cavazza, Cigdem Beyan et al.
This paper presents a novel end-to-end method for the problem of skeleton-based unsupervised human action recognition. We propose a new architecture with a convolutional autoencoder that uses graph Laplacian regularization to model the skeletal geometry across the temporal dynamics of actions. Our approach is robust towards viewpoint variations by including a self-supervised gradient reverse layer that ensures generalization across camera views. The proposed method is validated on NTU-60 and NTU-120 large-scale datasets in which it outperforms all prior unsupervised skeleton-based approaches on the cross-subject, cross-view, and cross-setup protocols. Although unsupervised, our learnable representation allows our method even to surpass a few supervised skeleton-based action recognition methods. The code is available in: www.github.com/IIT-PAVIS/UHAR_Skeletal_Laplacian
CVSep 26, 2024Code
Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target DetectionAndrea Toaiari, Vittorio Murino, Marco Cristani et al.
Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person's face, thus promoting privacy preservation in various application contexts. The code is available at https://github.com/intelligolabs/privacy-gtd-3D.
CVApr 12, 2023
Exploring Diffusion Models for Unsupervised Video Anomaly DetectionAnil Osman Tur, Nicola Dall'Asen, Cigdem Beyan et al.
This paper investigates the performance of diffusion models for video anomaly detection (VAD) within the most challenging but also the most operational scenario in which the data annotations are not used. As being sparse, diverse, contextual, and often ambiguous, detecting abnormal events precisely is a very ambitious task. To this end, we rely only on the information-rich spatio-temporal data, and the reconstruction power of the diffusion models such that a high reconstruction error is utilized to decide the abnormality. Experiments performed on two large-scale video anomaly detection datasets demonstrate the consistent improvement of the proposed method over the state-of-the-art generative models while in some cases our method achieves better scores than the more complex models. This is the first study using a diffusion model and examining its parameters' influence to present guidance for VAD in surveillance scenarios.
CVJul 18, 2023
Object-aware Gaze Target DetectionFrancesco Tonini, Nicola Dall'Asen, Cigdem Beyan et al.
Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location, however, they overlooked decoding the relationship between the people and the gazed objects. This paper proposes a Transformer-based architecture that automatically detects objects (including heads) in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of: gaze target area, gaze pixel point, the class and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics (up to 2.91% gain in AUC, 50% reduction in gaze distance, and 9% gain in out-of-frame average precision) for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects. The code of the proposed method is publicly available.
CVAug 16, 2023
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric VideosSanket Thakur, Cigdem Beyan, Pietro Morerio et al.
Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a multi-modal end-to-end transformer network, that attends to objects in observed frames in order to anticipate the next-active-object (NAO) and, eventually, to guide the model to predict context-aware future actions. The task is challenging since it requires anticipating future action along with the object with which the action occurs and the time after which the interaction will begin, a.k.a. the time to contact (TTC). Compared to existing video modeling architectures for action anticipation, NAOGAT captures the relationship between objects and the global scene context in order to predict detections for the next active object and anticipate relevant future actions given these detections, leveraging the objects' dynamics to improve accuracy. One of the key strengths of our approach, in fact, is its ability to exploit the motion dynamics of objects within a given clip, which is often ignored by other models, and separately decoding the object-centric and motion-centric information. Through our experiments, we show that our model outperforms existing methods on two separate datasets, Ego4D and EpicKitchens-100 ("Unseen Set"), as measured by several additional metrics, such as time to contact, and next-active-object localization. The code will be available upon acceptance.
CVFeb 13, 2023
Anticipating Next Active Objects for Egocentric VideosSanket Thakur, Cigdem Beyan, Pietro Morerio et al.
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance.
77.7CVApr 15Code
Towards Unconstrained Human-Object InteractionFrancesco Tonini, Alessandro Conti, Lorenzo Vaquero et al.
Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi
67.9CVMar 11Code
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental AdjustmentFanqi Yu, Matteo Tiezzi, Tommaso Apicella et al.
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
HCJul 20, 2022
Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A SurveyCigdem Beyan, Alessandro Vinciarelli, Alessio Del Bue
Automated co-located human-human interaction analysis has been addressed by the use of nonverbal communication as measurable evidence of social and psychological phenomena. We survey the computing studies (since 2010) detecting phenomena related to social traits (e.g., leadership, dominance, personality traits), social roles/relations, and interaction dynamics (e.g., group cohesion, engagement, rapport). Our target is to identify the nonverbal cues and computational methodologies resulting in effective performance. This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings (free-standing conversations, meetings, dyads, and crowds). We also present a comprehensive summary of the related datasets and outline future research directions which are regarding the implementation of artificial intelligence, dataset curation, and privacy-preserving interaction analysis. Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively; multimodal features are prominently performing better; deep learning architectures showed improved performance in overall, but there exist many phenomena whose detection has never been implemented through deep models. We also identified several limitations such as the lack of scalable benchmarks, annotation reliability tests, cross-dataset experiments, and explainability analysis.
CVSep 27, 2024Code
AL-GTD: Deep Active Learning for Gaze Target DetectionFrancesco Tonini, Nicola Dall'Asen, Lorenzo Vaquero et al.
Gaze target detection aims at determining the image location where a person is looking. While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor. In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL). Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance. Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples. We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime. Code is available at https://github.com/francescotonini/al-gtd.
CVSep 23, 2024Code
Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language ModelsAnil Osman Tur, Alessandro Conti, Cigdem Beyan et al.
In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods. The zero-shot assumption is essential to avoid the need for re-training the classifier every time a new product is introduced into stock or an existing product undergoes rebranding. In this paper, we make three key contributions. Firstly, we introduce the MIMEX dataset, comprising 28 distinct product categories. Unlike existing datasets in the literature, MIMEX focuses on fine-grained product classification and includes a diverse range of retail products. Secondly, we benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset. Our experiments reveal that these models achieve unsatisfactory fine-grained classification performance, highlighting the need for specialized approaches. Lastly, we propose a novel ensemble approach that integrates embeddings from CLIP and DINOv2 with dimensionality reduction techniques to enhance classification performance. By combining these components, our ensemble approach outperforms VLMs, effectively capturing visual cues crucial for fine-grained product discrimination. Additionally, we introduce a class adaptation method that utilizes visual prototyping with limited samples in scenarios with scarce labeled data, addressing a critical need in retail environments where product variety frequently changes. To encourage further research into zero-shot object classification for smart retail applications, we will release both the MIMEX dataset and benchmark to the research community. Interested researchers can contact the authors for details on the terms and conditions of use. The code is available: https://github.com/AnilOsmanTur/Zero-shot-Retail-Product-Classification.
CVJul 23, 2022
Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive LossRiccardo Franceschini, Enrico Fini, Cigdem Beyan et al.
Emotion recognition is involved in several real-world applications. With an increase in available modalities, automatic understanding of emotions is being performed more accurately. The success in Multimodal Emotion Recognition (MER), primarily relies on the supervised learning paradigm. However, data annotation is expensive, time-consuming, and as emotion expression and perception depends on several factors (e.g., age, gender, culture) obtaining labels with a high reliability is hard. Motivated by these, we focus on unsupervised feature learning for MER. We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature. Our end-to-end feature learning approach has several differences (and advantages) compared to existing MER methods: i) it is unsupervised, so the learning is lack of data labelling cost; ii) it does not require data spatial augmentation, modality alignment, large number of batch size or epochs; iii) it applies data fusion only at inference; and iv) it does not require backbones pre-trained on emotion recognition task. The experiments on benchmark datasets show that our method outperforms several baseline approaches and unsupervised learning methods applied in MER. Particularly, it even surpasses a few supervised MER state-of-the-art.
19.9CVApr 26Code
Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose EstimationNavid Aslankhani Khameneh, Marco Carletti, Cigdem Beyan
Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.
42.4CVApr 26Code
Discriminator-Guided Adaptive Diffusion for Source-Free Test-Time Adaptation under Image CorruptionsFrancesco Olivato, Cigdem Beyan, Vittorio Murino
In this work, we study Source-Free Unsupervised Domain Adaptation under corruption-induced domain shifts, where performance degradation is caused by natural image corruptions that go beyond additive noise, including blur, weather effects, and digital artifacts. We propose a diffusion-based, input-level adaptation framework that operates entirely at test time and keeps all source-trained models frozen, explicitly targeting robustness to corrupted target inputs. Our method leverages a source-trained diffusion model as a generative prior and introduces a discriminator-guided adaptive diffusion strategy that dynamically controls the amount of perturbation applied to each test sample. Rather than relying on a fixed diffusion depth, the discriminator determines, on a per-image basis, when sufficient forward diffusion has been applied to suppress corruption-specific artifacts, with each corruption type effectively defining a distinct target domain. This adaptive stopping mechanism applies only the necessary amount of noise to remove domainspecific corruption while preserving class-discriminative structure. The reverse diffusion process then reconstructs a source-aligned image, optionally stabilized through structural guidance, which is classified using a frozen source-trained classifier. We evaluate the proposed approach across a broad spectrum of corruption-induced target domains, covering 15 diverse corruption types, and demonstrate more balanced robustness with competitive or improved performance across non-noise corruptions. Additional analyses reveal how the adaptive diffusion schedule responds to different corruption characteristics, highlighting the practicality, generality, and robustness of the proposed framework. The code is publicly available at https://github.com/fmolivato/dgadiffusion/.
39.1CVApr 26Code
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQAFrancesco Dibitonto, Cigdem Beyan, Vittorio Murino
Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC
CVJun 30, 2025Code
MadCLIP: Few-shot Medical Anomaly Detection with CLIPMahshid Shiri, Cigdem Beyan, Vittorio Murino
An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at https://github.com/mahshid1998/MadCLIP.
CVJul 23, 2025Code
Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction DetectionFrancesco Tonini, Lorenzo Vaquero, Alessandro Conti et al.
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.
CVJul 9, 2025Code
MADPOT: Medical Anomaly Detection with CLIP Adaptation and Partial Optimal TransportMahshid Shiri, Cigdem Beyan, Vittorio Murino
Medical anomaly detection (AD) is challenging due to diverse imaging modalities, anatomical variations, and limited labeled data. We propose a novel approach combining visual adapters and prompt learning with Partial Optimal Transport (POT) and contrastive learning (CL) to improve CLIP's adaptability to medical images, particularly for AD. Unlike standard prompt learning, which often yields a single representation, our method employs multiple prompts aligned with local features via POT to capture subtle abnormalities. CL further enforces intra-class cohesion and inter-class separation. Our method achieves state-of-the-art results in few-shot, zero-shot, and cross-dataset scenarios without synthetic data or memory banks. The code is available at https://github.com/mahshid1998/MADPOT.
38.8LGMay 7
AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural NetworksYi Wei, Xuan Qi, Furao shen et al.
Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.
36.1LGMay 6
Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine NetworksXuan Qi, Yi Wei, Fanqi Yu et al.
Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local $\ell_\infty$ window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.
ROApr 11, 2024
Socially Pertinent Robots in Gerontological HealthcareXavier Alameda-Pineda, Angus Addlesee, Daniel Hernández García et al.
Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.
CVOct 18, 2024
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity DetectionAndrea Appiani, Cigdem Beyan
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
CVMay 25, 2023
Guided Attention for Next Active Object @ EGO4D STA ChallengeSanket Thakur, Cigdem Beyan, Pietro Morerio et al.
In this technical report, we describe the Guided-Attention mechanism based solution for the short-term anticipation (STA) challenge for the EGO4D challenge. It combines the object detections, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. For the challenge, we build our model on top of StillFast with Guided Attention applied on fast network. Our model obtains better performance on the validation set and also achieves state-of-the-art (SOTA) results on the challenge test set for EGO4D Short-Term Object Interaction Anticipation Challenge.
CVMay 22, 2023
Enhancing Next Active Object-based Egocentric Action Anticipation with Guided AttentionSanket Thakur, Cigdem Beyan, Pietro Morerio et al.
Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their interactions. To this end, we propose a novel approach that applies a guided attention mechanism between the objects, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. Our method, GANO (Guided Attention for Next active Objects) is a multi-modal, end-to-end, single transformer-based network. The experimental results performed on the largest egocentric dataset demonstrate that GANO outperforms the existing state-of-the-art methods for the prediction of the next active object label, its bounding box location, the corresponding future action, and the time to contact the object. The ablation study shows the positive contribution of the guided attention mechanism compared to other fusion methods. Moreover, it is possible to improve the next active object location and class label prediction results of GANO by just appending the learnable object tokens with the region of interest embeddings.
CVMay 6, 2021
Estimating Presentation Competence using Multimodal Nonverbal Behavioral CuesÖmer Sümer, Cigdem Beyan, Fabian Ruth et al.
Public speaking and presentation competence plays an essential role in many areas of social interaction in our educational, professional, and everyday life. Since our intention during a speech can differ from what is actually understood by the audience, the ability to appropriately convey our message requires a complex set of skills. Presentation competence is cultivated in the early school years and continuously developed over time. One approach that can promote efficient development of presentation competence is the automated analysis of human behavior during a speech based on visual and audio features and machine learning. Furthermore, this analysis can be used to suggest improvements and the development of skills related to presentation competence. In this work, we investigate the contribution of different nonverbal behavioral cues, namely, facial, body pose-based, and audio-related features, to estimate presentation competence. The analyses were performed on videos of 251 students while the automated assessment is based on manual ratings according to the Tübingen Instrument for Presentation Competence (TIP). Our classification results reached the best performance with early fusion in the same dataset evaluation (accuracy of 71.25%) and late fusion of speech, face, and body pose features in the cross dataset evaluation (accuracy of 78.11%). Similarly, regression results performed the best with fusion strategies.
CVJun 21, 2020
Subspace Clustering for Action Recognition with Covariance Representations and Temporal PruningGiancarlo Paoletti, Jacopo Cavazza, Cigdem Beyan et al.
This paper tackles the problem of human action recognition, defined as classifying which action is displayed in a trimmed sequence, from skeletal data. Albeit state-of-the-art approaches designed for this application are all supervised, in this paper we pursue a more challenging direction: Solving the problem with unsupervised learning. To this end, we propose a novel subspace clustering method, which exploits covariance matrix to enhance the action's discriminability and a timestamp pruning approach that allow us to better handle the temporal dimension of the data. Through a broad experimental validation, we show that our computational pipeline surpasses existing unsupervised approaches but also can result in favorable performances as compared to supervised methods.