Mihai Bâce

CV
h-index12
13papers
642citations
Novelty53%
AI Score47

13 Papers

11.4HCApr 18Code
Lightweight Cybersickness Detection based on User-Specific Eye and Head Tracking Data in Virtual Reality

Yijun Wang, Mihai Bâce, Maria Torres Vega

The occurrence of cybersickness in virtual reality (VR) significantly impairs users' perception and sense of immersion. Therefore, timely detection of cybersickness and the application of appropriate intervention strategies are crucial for enhancing the user experience. However, existing cybersickness detection methods often suffer from issues such as poor detection reliability across different levels of cybersickness and unnecessary model complexity. Furthermore, while cybersickness exhibits significant inter-user variability, most existing approaches aggregate all data from users and lack user-specific solutions. In this paper, we investigate a lightweight approach for cybersickness detection incorporating an ensemble learning model and user-specific eye and head tracking data. Our experiments using the open-source dataset Simulation 2021 demonstrate that feature engineering and training set construction are critical for determining detection performance. Models trained with data from similar-content segments achieve the best results, attaining detection accuracies of 93% in the cross-user setting and 88% in the user-personalized setting, using only 23-dimensional eye and head features. Moreover, by using user-specific data, well-tuned ensemble learning models with shorter training and inference times can be feasibly applied to real-world cybersickness detection, offering superior time efficiency and outstanding detection performance. This work offers useful evidence toward the development of lightweight and user-adaptive cybersickness detection models for VR applications.

CVAug 22, 2022
Neuro-Symbolic Visual Dialog

Adnen Abdessaied, Mihai Bâce, Andreas Bulling

We propose Neuro-Symbolic Visual Dialog (NSVD) -the first method to combine deep learning and symbolic program execution for multi-round visually-grounded reasoning. NSVD significantly outperforms existing purely-connectionist methods on two key challenges inherent to visual dialog: long-distance co-reference resolution as well as vanishing question-answering performance. We demonstrate the latter by proposing a more realistic and stricter evaluation scheme in which we use predicted answers for the full dialog history when calculating accuracy. We describe two variants of our model and show that using this new scheme, our best model achieves an accuracy of 99.72% on CLEVR-Dialog -a relative improvement of more than 10% over the state of the art while only requiring a fraction of training data. Moreover, we demonstrate that our neuro-symbolic models have a higher mean first failure round, are more robust against incomplete dialog histories, and generalise better not only to dialogs that are up to three times longer than those seen during training but also to unseen question types and scenes.

LGJun 20, 2023
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning

Anna Penzkofer, Simon Schaefer, Florian Strohm et al.

While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e. the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma's Revenge - one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.

CVMar 20, 2024
Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Florian Strohm, Mihai Bâce, Andreas Bulling

Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.

CVMar 26, 2024
DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Chuhan Jiao, Yao Wang, Guanhua Zhang et al.

We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360° images based on a conditional score-based denoising diffusion model. Generating human gaze on 360° images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360° images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360° image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.

CVDec 19, 2025
Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation

Alexandre Personnic, Mihai Bâce

Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.

CVDec 9, 2024
HAIFAI: Human-AI Interaction for Mental Face Reconstruction

Florian Strohm, Mihai Bâce, Andreas Bulling

We present HAIFAI - a novel two-stage system where humans and AI interact to tackle the challenging task of reconstructing a visual representation of a face that exists only in a person's mind. In the first stage, users iteratively rank images our reconstruction system presents based on their resemblance to a mental image. These rankings, in turn, allow the system to extract relevant image features, fuse them into a unified feature vector, and use a generative model to produce an initial reconstruction of the mental image. The second stage leverages an existing face editing method, allowing users to manually refine and further improve this reconstruction using an easy-to-use slider interface for face shape manipulation. To avoid the need for tedious human data collection for training the reconstruction system, we introduce a computational user model of human ranking behaviour. For this, we collected a small face ranking dataset through an online crowd-sourcing study containing data from 275 participants. We evaluate HAIFAI and an ablated version in a 12-participant user study and demonstrate that our approach outperforms the previous state of the art regarding reconstruction quality, usability, perceived workload, and reconstruction speed. We further validate the reconstructions in a subsequent face ranking study with 18 participants and show that HAIFAI achieves a new state-of-the-art identification rate of 60.6%. These findings represent a significant advancement towards developing new interactive intelligent systems capable of reliably and effortlessly reconstructing a user's mental image.

CVMar 20, 2024
UP-FacE: User-predictable Fine-grained Face Shape Editing

Florian Strohm, Mihai Bâce, Andreas Bulling

We present User-predictable Face Editing (UP-FacE) -- a novel method for predictable face shape editing. In stark contrast to existing methods for face editing using trial and error, edits with UP-FacE are predictable by the human user. That is, users can control the desired degree of change precisely and deterministically and know upfront the amount of change required to achieve a certain editing result. Our method leverages facial landmarks to precisely measure facial feature values, facilitating the training of UP-FacE without manually annotated attribute labels. At the core of UP-FacE is a transformer-based network that takes as input a latent vector from a pre-trained generative model and a facial feature embedding, and predicts a suitable manipulation vector. To enable user-predictable editing, a scaling layer adjusts the manipulation vector to achieve the precise desired degree of change. To ensure that the desired feature is manipulated towards the target value without altering uncorrelated features, we further introduce a novel semantic face feature loss. Qualitative and quantitative results demonstrate that UP-FacE enables precise and fine-grained control over 23 face shape features.

HCDec 30, 2021
VisRecall: Quantifying Information Visualisation Recallability via Question Answering

Yao Wang, Chuhan Jiao, Mihai Bâce et al.

Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work, we propose a question-answering paradigm to study visualisation recallability and present VisRecall - a novel dataset consisting of 200 visualisations that are annotated with crowd-sourced human (N = 305) recallability scores obtained from 1,000 questions of five question types. Furthermore, we present the first computational method to predict recallability of different visualisation elements, such as the title or specific data values. We report detailed analyses of our method on VisRecall and demonstrate that it outperforms several baselines in overall recallability and FE-, F-, RV-, and U-question recallability. Our work makes fundamental contributions towards a new generation of methods to assist designers in optimising visualisations.

CVDec 4, 2021
Scanpath Prediction on Information Visualisations

Yao Wang, Mihai Bâce, Andreas Bulling

We propose Unified Model of Saliency and Scanpaths (UMSS) -- a model that learns to predict visual saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect to several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5% for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6% for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.

CVAug 17, 2021
Neural Photofit: Gaze-based Mental Image Reconstruction

Florian Strohm, Ekta Sood, Sven Mayer et al.

We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer's mental image.

HCJul 25, 2019
Accurate and Robust Eye Contact Detection During Everyday Mobile Device Interactions

Mihai Bâce, Sander Staal, Andreas Bulling

Quantification of human attention is key to several tasks in mobile human-computer interaction (HCI), such as predicting user interruptibility, estimating noticeability of user interface content, or measuring user engagement. Previous works to study mobile attentive behaviour required special-purpose eye tracking equipment or constrained users' mobility. We propose a novel method to sense and analyse visual attention on mobile devices during everyday interactions. We demonstrate the capabilities of our method on the sample task of eye contact detection that has recently attracted increasing research interest in mobile HCI. Our method builds on a state-of-the-art method for unsupervised eye contact detection and extends it to address challenges specific to mobile interactive scenarios. Through evaluation on two current datasets, we demonstrate significant performance improvements for eye contact detection across mobile devices, users, or environmental conditions. Moreover, we discuss how our method enables the calculation of additional attention metrics that, for the first time, enable researchers from different domains to study and quantify attention allocation during mobile interactions in the wild.

HCJul 25, 2019
How far are we from quantifying visual attention in mobile HCI?

Mihai Bâce, Sander Staal, Andreas Bulling

With an ever-increasing number of mobile devices competing for our attention, quantifying when, how often, or for how long users visually attend to their devices has emerged as a core challenge in mobile human-computer interaction. Encouraged by recent advances in automatic eye contact detection using machine learning and device-integrated cameras, we provide a fundamental investigation into the feasibility of quantifying visual attention during everyday mobile interactions. We identify core challenges and sources of errors associated with sensing attention on mobile devices in the wild, including the impact of face and eye visibility, the importance of robust head pose estimation, and the need for accurate gaze estimation. Based on this analysis, we propose future research directions and discuss how eye contact detection represents the foundation for exciting new applications towards next-generation pervasive attentive user interfaces.