Michael J. Proulx

CV
h-index34
9papers
38citations
Novelty43%
AI Score41

9 Papers

84.9CVMay 30
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx et al.

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.

99.6HCMay 7
GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment

Bin Wang, Yue Liu, Benjamin Newman et al.

Smart glasses with AI assistants are increasingly used in daily life. However, current systems lack awareness of the user's internal cognitive state, leaving them unable to proactively anticipate users' needs without access to cognitive load. Existing methods for assessing cognitive load either rely on impractical sensors for lightweight eyewear or utilize eye gaze-based models that suffer from poor interpretability, and require task-specific fine-tuning, often failing to generalize across individuals. We propose GazeMind, a gaze-guided LLM agent framework for cognitive load assessment on smart glasses. It encodes eye-tracking data into structured representations for LLM-based reasoning and provides interpretable cognitive load predictions. Importantly, GazeMind generalizes across scenarios without LLM fine-tuning through a novel task-guidance reasoning approach and achieves personalized adaptation by incorporating user-specific characteristics and historical references. To support evaluation, we introduce CogLoad-Bench, the largest gaze-based cognitive load dataset with 152 participants, 40+ hours of multimodal data, and 10K+ real-time annotations across controlled and real-world tasks. Experiments show that GazeMind achieves state-of-the-art performance, outperforming baselines by over 20% across all metrics.

LGJun 19, 2023
Performance of data-driven inner speech decoding with same-task EEG-fMRI data fusion and bimodal models

Holly Wilson, Scott Wellington, Foteini Simistira Liwicki et al.

Decoding inner speech from the brain signal via hybridisation of fMRI and EEG data is explored to investigate the performance benefits over unimodal models. Two different bimodal fusion approaches are examined: concatenation of probability vectors output from unimodal fMRI and EEG machine learning models, and data fusion with feature engineering. Same task inner speech data are recorded from four participants, and different processing strategies are compared and contrasted to previously-employed hybridisation methods. Data across participants are discovered to encode different underlying structures, which results in varying decoding performances between subject-dependent fusion models. Decoding performance is demonstrated as improved when pursuing bimodal fMRI-EEG fusion strategies, if the data show underlying structure.

CVApr 17, 2024
Establishing a Baseline for Gaze-driven Authentication Performance in VR: A Breadth-First Investigation on a Very Large Dataset

Dillon Lohr, Michael J. Proulx, Oleg Komogortsev

This paper performs the crucial work of establishing a baseline for gaze-driven authentication performance to begin answering fundamental research questions using a very large dataset of gaze recordings from 9202 people with a level of eye tracking (ET) signal quality equivalent to modern consumer-facing virtual reality (VR) platforms. The size of the employed dataset is at least an order-of-magnitude larger than any other dataset from previous related work. Binocular estimates of the optical and visual axes of the eyes and a minimum duration for enrollment and verification are required for our model to achieve a false rejection rate (FRR) of below 3% at a false acceptance rate (FAR) of 1 in 50,000. In terms of identification accuracy which decreases with gallery size, we estimate that our model would fall below chance-level accuracy for gallery sizes of 148,000 or more. Our major findings indicate that gaze authentication can be as accurate as required by the FIDO standard when driven by a state-of-the-art machine learning architecture and a sufficiently large training dataset.

HCJan 23, 2025
Eye Gaze as a Signal for Conveying User Attention in Contextual AI Systems

Ethan Wilson, Naveen Sendhilnathan, Charlie S. Burlingham et al.

Advanced multimodal AI agents can now collaborate with users to solve challenges in the world. Yet, these emerging contextual AI systems rely on explicit communication channels between the user and system. We hypothesize that implicit communication of the user's interests and intent would reduce friction and improve user experience when collaborating with AI agents. In this work, we explore the potential of wearable eye tracking to convey signals about user attention. We measure the eye tracking signal quality requirements to effectively map gaze traces to physical objects, then conduct experiments that provide visual scanpath history as additional context when querying vision language models. Our results show that eye tracking provides high value as a user attention signal and can convey important context about the user's current task and interests, improving understanding of contextual AI agents.

CVMay 30, 2025
Reading Recognition in the Wild

Charig Yang, Samiul Alam, Shakhrul Iman Siam et al.

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

CVMay 22, 2025
Ocular Authentication: Fusion of Gaze and Periocular Modalities

Dillon Lohr, Michael J. Proulx, Mehedi Hasan Raju et al.

This paper investigates the feasibility of fusing two eye-centric authentication modalities-eye movements and periocular images-within a calibration-free authentication system. While each modality has independently shown promise for user authentication, their combination within a unified gaze-estimation pipeline has not been thoroughly explored at scale. In this report, we propose a multimodal authentication system and evaluate it using a large-scale in-house dataset comprising 9202 subjects with an eye tracking (ET) signal quality equivalent to a consumer-facing virtual reality (VR) device. Our results show that the multimodal approach consistently outperforms both unimodal systems across all scenarios, surpassing the FIDO benchmark. The integration of a state-of-the-art machine learning architecture contributed significantly to the overall authentication performance at scale, driven by the model's ability to capture authentication representations and the complementary discriminative characteristics of the fused modalities.

HCMay 27, 2021
Electromagnetic actuation for a vibrotactile display: Assessing stimuli complexity and usability

Michael J. Proulx, Theodoros Eracleous, Ben Spencer et al.

Sensory substitution has influenced the design of many tactile visual substitution systems with the aim of offering visual aids for the blind. This paper focuses on whether a novel electromagnetic vibrotactile display, a four by four vibrotactile matrix of taxels, can serve as an aid for dynamic communication for visually impaired people. A mixed methods approach was used to firstly assess whether pattern complexity affected undergraduate participants' perceptive success, and secondly, if participants total score positively correlated with their perceived success ratings. A thematic analysis was also conducted on participants' experiences with the vibrotactile display and what methods of interaction they used. The results indicated that complex patterns were less accurately perceived than simple and linear patterns respectively, and no significant correlation was found between participants' score and perceived success ratings. Additionally, most participants interacted with the vibrotactile display in similar ways using one finger to feel one taxel at a time; arguably, the most effective strategy from previous research. This technology could have applications to navigational and communication aids for the visually impaired and road users.

HCJan 14, 2021
Exploring Asymmetric Roles in Mixed-Ability Gaming

David Gonçalves, André Rodrigues, Mike L. Richardson et al.

The landscape of digital games is segregated by player ability. For example, sighted players have a multitude of highly visual games at their disposal, while blind players may choose from a variety of audio games. Attempts at improving cross-ability access to any of those are often limited in the experience they provide, or disregard multiplayer experiences. We explore ability-based asymmetric roles as a design approach to create engaging and challenging mixed-ability play. Our team designed and developed two collaborative testbed games exploring asymmetric interdependent roles. In a remote study with 13 mixed-visual-ability pairs we assessed how roles affected perceptions of engagement, competence, and autonomy, using a mixed-methods approach. The games provided an engaging and challenging experience, in which differences in visual ability were not limiting. Our results underline how experiences unequal by design can give rise to an equitable joint experience.