CVAug 16, 2023
MultiMediate'23: Engagement Estimation and Bodily Behaviour Recognition in Social InteractionsPhilipp Müller, Michal Balazia, Tobias Baur et al.
Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate'23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognition in social interactions. This paper describes the MultiMediate'23 challenge and presents novel sets of annotations for both tasks. For engagement estimation we collected novel annotations on the NOvice eXpert Interaction (NOXI) database. For bodily behaviour recognition, we annotated test recordings of the MPIIGroupInteraction corpus with the BBSI annotation scheme. In addition, we present baseline results for both challenge tasks.
LGApr 30, 2022
Gaze-enhanced Crossmodal Embeddings for Emotion RecognitionAhmed Abdou, Ekta Sood, Philipp Müller et al.
Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide detailed insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.
CVJul 26, 2022
Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art EvaluationMichal Balazia, Philipp Müller, Ákos Levente Tánczos et al.
Body language is an eye-catching social signal and its automatic analysis can significantly advance artificial intelligence systems to understand and actively participate in social interactions. While computer vision has made impressive progress in low-level tasks like head and body pose estimation, the detection of more subtle behaviors such as gesturing, grooming, or fumbling is not well explored. In this paper we present BBSI, the first set of annotations of complex Bodily Behaviors embedded in continuous Social Interactions in a group setting. Based on previous work in psychology, we manually annotated 26 hours of spontaneous human behavior in the MPIIGroupInteraction dataset with 15 distinct body language classes. We present comprehensive descriptive statistics on the resulting dataset as well as results of annotation quality evaluations. For automatic detection of these behaviors, we adapt the Pyramid Dilated Attention Network (PDAN), a state-of-the-art approach for human action detection. We perform experiments using four variants of spatial-temporal features as input to PDAN: Two-Stream Inflated 3D CNN, Temporal Segment Networks, Temporal Shift Module and Swin Transformer. Results are promising and indicate a great room for improvement in this difficult task. Representing a key piece in the puzzle towards automatic understanding of social behavior, BBSI is fully available to the research community.
CVDec 7, 2022
Multimodal Vision Transformers with Forced Attention for Behavior AnalysisTanay Agrawal, Michal Balazia, Philipp Müller et al.
Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.
CVJun 2, 2023
Backchannel Detection and Agreement Estimation from Video with Transformer NetworksAhmed Amer, Chirag Bhuvaneshwara, Gowtham K. Addluri et al.
Listeners use short interjections, so-called backchannels, to signify attention or express agreement. The automatic analysis of this behavior is of key importance for human conversation analysis and interactive conversational agents. Current state-of-the-art approaches for backchannel analysis from visual behavior make use of two types of features: features based on body pose and features based on facial behavior. At the same time, transformer neural networks have been established as an effective means to fuse input from different data sources, but they have not yet been applied to backchannel analysis. In this work, we conduct a comprehensive evaluation of multi-modal transformer architectures for automatic backchannel analysis based on pose and facial information. We address both the detection of backchannels as well as the task of estimating the agreement expressed in a backchannel. In evaluations on the MultiMediate'22 backchannel detection challenge, we reach 66.4% accuracy with a one-layer transformer architecture, outperforming the previous state of the art. With a two-layer transformer architecture, we furthermore set a new state of the art (0.0604 MSE) on the task of estimating the amount of agreement expressed in a backchannel.
CLAug 8, 2024
Recognizing Emotion Regulation Strategies from Human Behavior with Large Language ModelsPhilipp Müller, Alexander Heimerl, Sayed Muddashir Hossain et al.
Human emotions are often not expressed directly, but regulated according to internal processes and social display rules. For affective computing systems, an understanding of how users regulate their emotions can be highly useful, for example to provide feedback in job interview training, or in psychotherapeutic scenarios. However, at present no method to automatically classify different emotion regulation strategies in a cross-user scenario exists. At the same time, recent studies showed that instruction-tuned Large Language Models (LLMs) can reach impressive performance across a variety of affect recognition tasks such as categorical emotion recognition or sentiment analysis. While these results are promising, it remains unclear to what extent the representational power of LLMs can be utilized in the more subtle task of classifying users' internal emotion regulation strategy. To close this gap, we make use of the recently introduced \textsc{Deep} corpus for modeling the social display of the emotion shame, where each point in time is annotated with one of seven different emotion regulation classes. We fine-tune Llama2-7B as well as the recently introduced Gemma model using Low-rank Optimization on prompts generated from different sources of information on the \textsc{Deep} corpus. These include verbal and nonverbal behavior, person factors, as well as the results of an in-depth interview after the interaction. Our results show, that a fine-tuned Llama2-7B LLM is able to classify the utilized emotion regulation strategy with high accuracy (0.84) without needing access to data from post-interaction interviews. This represents a significant improvement over previous approaches based on Bayesian Networks and highlights the importance of modeling verbal behavior in emotion regulation.
25.6LGApr 17
(Weighted) Adaptive Radius Near Neighbor Search: Evaluation for WiFi Fingerprint-based PositioningKhang Le, Joaquín Torres-Sospedra, Philipp Müller
Fixed Radius Near Neighbor (FRNN) search is an alternative to the widely used k Nearest Neighbors (kNN) search. Unlike kNN, FRNN determines a label or an estimate for a test sample based on all training samples within a predefined distance. While this approach is beneficial in certain scenarios, assuming a fixed maximum distance for all training samples can decrease the accuracy of the FRNN. Therefore, in this paper we propose the Adaptive Radius Near Neighbor (ARNN) and the Weighted ARNN (WARNN), which employ adaptive distances and in latter case weights. All three methods are compared to kNN and twelve of its variants for a regression problem, namely WiFi fingerprinting indoor positioning, using 22 different datasets to provide a comprehensive analysis. While the performances of the tested FRNN and ARNN versions were amongst the worse, three of the four best methods in the test were WARNN versions, indicating that using weights together with adaptive distances achieves performance comparable or even better than kNN variants.
SPMay 14, 2025Code
Evaluation in EEG Emotion Recognition: State-of-the-Art Review and Unified FrameworkNatia Kukhilava, Tatia Tsmindashvili, Rapael Kalandadze et al.
Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field's progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (https://github.com/EmotionLab/EEGain), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.
LGSep 14, 2021Code
HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPOKatharina Eggensperger, Philipp Müller, Neeratyoy Mallik et al.
To achieve peak predictive performance, hyperparameter optimization (HPO) is a crucial component of machine learning and its applications. Over the last years, the number of efficient algorithms and tools for HPO grew substantially. At the same time, the community is still lacking realistic, diverse, computationally cheap, and standardized benchmarks. This is especially the case for multi-fidelity HPO methods. To close this gap, we propose HPOBench, which includes 7 existing and 5 new benchmark families, with a total of more than 100 multi-fidelity benchmark problems. HPOBench allows to run this extendable set of multi-fidelity HPO benchmarks in a reproducible way by isolating and packaging the individual benchmarks in containers. It also provides surrogate and tabular benchmarks for computationally affordable yet statistically sound evaluations. To demonstrate HPOBench's broad compatibility with various optimization tools, as well as its usefulness, we conduct an exemplary large-scale study evaluating 13 optimizers from 6 optimization tools. We provide HPOBench here: https://github.com/automl/HPOBench.
LGApr 20, 2023
Flexible K Nearest Neighbors Classifier: Derivation and Application for Ion-mobility Spectrometry-based Indoor LocalizationPhilipp Müller
The K Nearest Neighbors (KNN) classifier is widely used in many fields such as fingerprint-based localization or medicine. It determines the class membership of unlabelled sample based on the class memberships of the K labelled samples, the so-called nearest neighbors, that are closest to the unlabelled sample. The choice of K has been the topic of various studies and proposed KNN-variants. Yet no variant has been proven to outperform all other variants. In this paper a KNN-variant is discussed which ensures that the K nearest neighbors are indeed close to the unlabelled sample and finds K along the way. The algorithm is tested and compared to the standard KNN in theoretical scenarios and for indoor localization based on ion-mobility spectrometry fingerprints. It achieves a higher classification accuracy than the KNN in the tests, while having the same computational demand.
HCAug 3, 2025
Implicit Search Intent Recognition using EEG and Eye Tracking: Novel Dataset and Cross-User PredictionMansi Sharma, Shuang Chen, Philipp Müller et al.
For machines to effectively assist humans in challenging visual search tasks, they must differentiate whether a human is simply glancing into a scene (navigational intent) or searching for a target object (informational intent). Previous research proposed combining electroencephalography (EEG) and eye-tracking measurements to recognize such search intents implicitly, i.e., without explicit user input. However, the applicability of these approaches to real-world scenarios suffers from two key limitations. First, previous work used fixed search times in the informational intent condition -- a stark contrast to visual search, which naturally terminates when the target is found. Second, methods incorporating EEG measurements addressed prediction scenarios that require ground truth training data from the target user, which is impractical in many use cases. We address these limitations by making the first publicly available EEG and eye-tracking dataset for navigational vs. informational intent recognition, where the user determines search times. We present the first method for cross-user prediction of search intents from EEG and eye-tracking recordings and reach 84.5% accuracy in leave-one-user-out evaluations -- comparable to within-user prediction accuracy (85.5%) but offering much greater flexibility
CLApr 4, 2024
M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational InterviewsSayed Muddashir Hossain, Jan Alexandersson, Philipp Müller
Accurate utterance classification in motivational interviews is crucial to automatically understand the quality and dynamics of client-therapist interaction, and it can serve as a key input for systems mediating such interactions. Motivational interviews exhibit three important characteristics. First, there are two distinct roles, namely client and therapist. Second, they are often highly emotionally charged, which can be expressed both in text and in prosody. Finally, context is of central importance to classify any given utterance. Previous works did not adequately incorporate all of these characteristics into utterance classification approaches for mental health dialogues. In contrast, we present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification. Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour. Furthermore, M3TCM integrates information from the text and speech modality as well as the conversation context. With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification. In extensive ablation studies, we quantify the improvement resulting from each contribution.
CVAug 3, 2025
Distinguishing Target and Non-Target Fixations with EEG and Eye Tracking in Realistic Visual ScenesMansi Sharma, Camilo Andrés Martínez Martínez, Benedikt Emanuel Wirth et al.
Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users' intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.
CVJan 19
Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience AnnotationsTim Lachmann, Alexandra Israelsson, Christina Tornberg et al.
Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.
CLMar 27, 2025
AutoPsyC: Automatic Recognition of Psychodynamic Conflicts from Semi-structured Interviews with Large Language ModelsSayed Muddashir Hossain, Simon Ostermann, Patrick Gebhard et al.
Psychodynamic conflicts are persistent, often unconscious themes that shape a person's behaviour and experiences. Accurate diagnosis of psychodynamic conflicts is crucial for effective patient treatment and is commonly done via long, manually scored semi-structured interviews. Existing automated solutions for psychiatric diagnosis tend to focus on the recognition of broad disorder categories such as depression, and it is unclear to what extent psychodynamic conflicts which even the patient themselves may not have conscious access to could be automatically recognised from conversation. In this paper, we propose AutoPsyC, the first method for recognising the presence and significance of psychodynamic conflicts from full-length Operationalized Psychodynamic Diagnostics (OPD) interviews using Large Language Models (LLMs). Our approach combines recent advances in parameter-efficient fine-tuning and Retrieval-Augmented Generation (RAG) with a summarisation strategy to effectively process entire 90 minute long conversations. In evaluations on a dataset of 141 diagnostic interviews we show that AutoPsyC consistently outperforms all baselines and ablation conditions on the recognition of four highly relevant psychodynamic conflicts.
LGJan 13, 2024
Classification of Volatile Organic Compounds by Differential Mobility Spectrometry Based on Continuity of Alpha CurvesAnton Rauhameri, Angelo Robiños, Osmo Anttalainen et al.
Background: Classification of volatile organic compounds (VOCs) is of interest in many fields. Examples include but are not limited to medicine, detection of explosives, and food quality control. Measurements collected with electronic noses can be used for classification and analysis of VOCs. One type of electronic noses that has seen considerable development in recent years is Differential Mobility Spectrometry (DMS). DMS yields measurements that are visualized as dispersion plots that contain traces, also known as alpha curves. Current methods used for analyzing DMS dispersion plots do not usually utilize the information stored in the continuity of these traces, which suggests that alternative approaches should be investigated. Results: In this work, for the first time, dispersion plots were interpreted as a series of measurements evolving sequentially. Thus, it was hypothesized that time-series classification algorithms can be effective for classification and analysis of dispersion plots. An extensive dataset of 900 dispersion plots for five chemicals measured at five flow rates and two concentrations was collected. The data was used to analyze the classification performance of six algorithms. According to our hypothesis, the highest classification accuracy of 88\% was achieved by a Long-Short Term Memory neural network, which supports our hypothesis. Significance: A new concept for approaching classification tasks of dispersion plots is presented and compared with other well-known classification algorithms. This creates a new angle of view for analysis and classification of the dispersion plots. In addition, a new dataset of dispersion plots is openly shared to public.
CVSep 27, 2021
Multimodal Integration of Human-Like Attention in Visual Question AnsweringEkta Sood, Fabian Kögel, Philipp Müller et al.
Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA
CVAug 17, 2021
Neural Photofit: Gaze-based Mental Image ReconstructionFlorian Strohm, Ekta Sood, Sven Mayer et al.
We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer's mental image.
AISep 29, 2020
Neural Model-based Optimization with Right-Censored ObservationsKatharina Eggensperger, Kai Haase, Philipp Müller et al.
In many fields of study, we only observe lower bounds on the true response value of some experiments. When fitting a regression model to predict the distribution of the outcomes, we cannot simply drop these right-censored observations, but need to properly model them. In this work, we focus on the concept of censored data in the light of model-based optimization where prematurely terminating evaluations (and thus generating right-censored data) is a key factor for efficiency, e.g., when searching for an algorithm configuration that minimizes runtime of the algorithm at hand. Neural networks (NNs) have been demonstrated to work well at the core of model-based optimization procedures and here we extend them to handle these censored observations. We propose (i)~a loss function based on the Tobit model to incorporate censored samples into training and (ii) use an ensemble of networks to model the posterior distribution. To nevertheless be efficient in terms of optimization-overhead, we propose to use Thompson sampling s.t. we only need to train a single NN in each iteration. Our experiments show that our trained regression models achieve a better predictive quality than several baselines and that our approach achieves new state-of-the-art performance for model-based optimization on two optimization problems: minimizing the solution time of a SAT solver and the time-to-accuracy of neural networks.
LGAug 16, 2019
BOAH: A Tool Suite for Multi-Fidelity Bayesian Optimization & Analysis of HyperparametersMarius Lindauer, Katharina Eggensperger, Matthias Feurer et al.
Hyperparameter optimization and neural architecture search can become prohibitively expensive for regular black-box Bayesian optimization because the training and evaluation of a single model can easily take several hours. To overcome this, we introduce a comprehensive tool suite for effective multi-fidelity Bayesian optimization and the analysis of its runs. The suite, written in Python, provides a simple way to specify complex design spaces, a robust and efficient combination of Bayesian optimization and HyperBand, and a comprehensive analysis of the optimization process and its outcomes.
HCMay 6, 2019
Emergent Leadership Detection Across DatasetsPhilipp Müller, Andreas Bulling
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets -- an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.
HCJan 18, 2018
Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable SensorsJulian Steil, Philipp Müller, Yusuke Sugano et al.
Visual attention is highly fragmented during mobile interactions, but the erratic nature of attention shifts currently limits attentive user interfaces to adapting after the fact, i.e. after shifts have already happened. We instead study attention forecasting -- the challenging task of predicting users' gaze behaviour (overt visual attention) in the near future. We present a novel long-term dataset of everyday mobile phone interactions, continuously recorded from 20 participants engaged in common activities on a university campus over 4.5 hours each (more than 90 hours in total). We propose a proof-of-concept method that uses device-integrated sensors and body-worn cameras to encode rich information on device usage and users' visual scene. We demonstrate that our method can forecast bidirectional attention shifts and predict whether the primary attentional focus is on the handheld mobile device. We study the impact of different feature sets on performance and discuss the significant potential but also remaining challenges of forecasting user attention during mobile interactions.