CVDec 31, 2022
Skeletal Video Anomaly Detection using Deep Learning: Survey, Challenges and Future DirectionsPratik K. Mishra, Alex Mihailidis, Shehroz S. Khan · utoronto
The existing methods for video anomaly detection mostly utilize videos containing identifiable facial and appearance-based features. The use of videos with identifiable faces raises privacy concerns, especially when used in a hospital or community-based setting. Appearance-based features can also be sensitive to pixel-based noise, straining the anomaly detection methods to model the changes in the background and making it difficult to focus on the actions of humans in the foreground. Structural information in the form of skeletons describing the human motion in the videos is privacy-protecting and can overcome some of the problems posed by appearance-based features. In this paper, we present a survey of privacy-protecting deep learning anomaly detection methods using skeletons extracted from videos. We present a novel taxonomy of algorithms based on the various learning approaches. We conclude that skeleton-based approaches for anomaly detection can be a plausible privacy-protecting alternative for video anomaly detection. Lastly, we identify major open research questions and provide guidelines to address them.
CVDec 20, 2022
Privacy-Protecting Behaviours of Risk Detection in People with Dementia using VideosPratik K. Mishra, Andrea Iaboni, Bing Ye et al. · utoronto
People living with dementia often exhibit behavioural and psychological symptoms of dementia that can put their and others' safety at risk. Existing video surveillance systems in long-term care facilities can be used to monitor such behaviours of risk to alert the staff to prevent potential injuries or death in some cases. However, these behaviours of risk events are heterogeneous and infrequent in comparison to normal events. Moreover, analyzing raw videos can also raise privacy concerns. In this paper, we present two novel privacy-protecting video-based anomaly detection approaches to detect behaviours of risks in people with dementia. We either extracted body pose information as skeletons or used semantic segmentation masks to replace multiple humans in the scene with their semantic boundaries. Our work differs from most existing approaches for video anomaly detection that focus on appearance-based features, which can put the privacy of a person at risk and is also susceptible to pixel-based noise, including illumination and viewing direction. We used anonymized videos of normal activities to train customized spatio-temporal convolutional autoencoders and identify behaviours of risk as anomalies. We showed our results on a real-world study conducted in a dementia care unit with patients with dementia, containing approximately 21 hours of normal activities data for training and 9 hours of data containing normal and behaviours of risk events for testing. We compared our approaches with the original RGB videos and obtained a similar area under the receiver operating characteristic curve performance of 0.807 for the skeleton-based approach and 0.823 for the segmentation mask-based approach.
CVJun 25, 2022
Multi Visual Modality Fall Detection DatasetStefan Denkovski, Shehroz S. Khan, Brandon Malamis et al. · utoronto
Falls are one of the leading cause of injury-related deaths among the elderly worldwide. Effective detection of falls can reduce the risk of complications and injuries. Fall detection can be performed using wearable devices or ambient sensors; these methods may struggle with user compliance issues or false alarms. Video cameras provide a passive alternative; however, regular RGB cameras are impacted by changing lighting conditions and privacy concerns. From a machine learning perspective, developing an effective fall detection system is challenging because of the rarity and variability of falls. Many existing fall detection datasets lack important real-world considerations, such as varied lighting, continuous activities of daily living (ADLs), and camera placement. The lack of these considerations makes it difficult to develop predictive models that can operate effectively in the real world. To address these limitations, we introduce a novel multi-modality dataset (MUVIM) that contains four visual modalities: infra-red, depth, RGB and thermal cameras. These modalities offer benefits such as obfuscated facial features and improved performance in low-light conditions. We formulated fall detection as an anomaly detection problem, in which a customized spatio-temporal convolutional autoencoder was trained only on ADLs so that a fall would increase the reconstruction error. Our results showed that infra-red cameras provided the highest level of performance (AUC ROC=0.94), followed by thermal (AUC ROC=0.87), depth (AUC ROC=0.86) and RGB (AUC ROC=0.83). This research provides a unique opportunity to analyze the utility of camera modalities in detecting falls in a home setting while balancing performance, passiveness, and privacy.
CLSep 13, 2022
AI-powered Language Assessment Tools for DementiaMahboobeh Parsapoor, Muhammad Raisul Alam, Alex Mihailidis · utoronto
The main objective of this paper is to propose an approach for developing an Artificial Intelligence (AI)-powered Language Assessment (LA) tool. Such tools can be used to assess language impairments associated with dementia in older adults. The Machine Learning (ML) classifiers are the main parts of our proposed approach, therefore to develop an accurate tool with high sensitivity and specificity, we consider different binary classifiers and evaluate their performances. We also assess the reliability and validity of our approach by comparing the impact of different types of language tasks, features, and recording media on the performance of ML classifiers.
LGFeb 7, 2023
Undersampling and Cumulative Class Re-decision Methods to Improve Detection of Agitation in People with DementiaZhidong Meng, Andrea Iaboni, Bing Ye et al. · utoronto
Agitation is one of the most prevalent symptoms in people with dementia (PwD) that can place themselves and the caregiver's safety at risk. Developing objective agitation detection approaches is important to support health and safety of PwD living in a residential setting. In a previous study, we collected multimodal wearable sensor data from 17 participants for 600 days and developed machine learning models for detecting agitation in one-minute windows. However, there are significant limitations in the dataset, such as imbalance problem and potential imprecise labelsas the occurrence of agitation is much rarer in comparison to the normal behaviours. In this paper, we first implemented different undersampling methods to eliminate the imbalance problem, and came to the conclusion that only 20% of normal behaviour data were adequate to train a competitive agitation detection model. Then, we designed a weighted undersampling method to evaluate the manual labeling mechanism given the ambiguous time interval assumption. After that, the postprocessing method of cumulative class re-decision (CCR) was proposed based on the historical sequential information and continuity characteristic of agitation, improving the decision-making performance for the potential application of agitation detection system. The results showed that a combination of undersampling and CCR improved F1-score and other metrics to varying degrees with less training time and data.
CVAug 28, 2024
Depth-Weighted Detection of Behaviours of Risk in People with Dementia using CamerasPratik K. Mishra, Irene Ballester, Andrea Iaboni et al. · utoronto
The behavioural and psychological symptoms of dementia, such as agitation and aggression, present a significant health and safety risk in residential care settings. Many care facilities have video cameras in place for digital monitoring of public spaces, which can be leveraged to develop an automated behaviours of risk detection system that can alert the staff to enable timely intervention and prevent the situation from escalating. However, one of the challenges in our previous study was the presence of false alarms due to disparate importance of events based on distance. To address this issue, we proposed a novel depth-weighted loss to enforce equivalent importance to the events happening both near and far from the cameras; thus, helping to reduce false alarms. We further propose to utilize the training outliers to determine the anomaly threshold. The data from nine dementia participants across three cameras in a specialized dementia unit were used for training. The proposed approach obtained the best area under receiver operating characteristic curve performance of 0.852, 0.81 and 0.768, respectively, for the three cameras. Ablation analysis was conducted for the individual components of the proposed approach and effect of frame size and frame rate. The performance of the proposed approach was investigated for cross-camera, participant-specific and sex-specific behaviours of risk detection. The proposed approach performed reasonably well in reducing false alarms. This motivates further research to make the system more suitable for deployment in care facilities.
CVNov 6, 2023
Temporal Shift -- Multi-Objective Loss Function for Improved Anomaly Fall DetectionStefan Denkovski, Shehroz S. Khan, Alex Mihailidis
Falls are a major cause of injuries and deaths among older adults worldwide. Accurate fall detection can help reduce potential injuries and additional health complications. Different types of video modalities can be used in a home setting to detect falls, including RGB, Infrared, and Thermal cameras. Anomaly detection frameworks using autoencoders and their variants can be used for fall detection due to the data imbalance that arises from the rarity and diversity of falls. However, the use of reconstruction error in autoencoders can limit the application of networks' structures that propagate information. In this paper, we propose a new multi-objective loss function called Temporal Shift, which aims to predict both future and reconstructed frames within a window of sequential frames. The proposed loss function is evaluated on a semi-naturalistic fall detection dataset containing multiple camera modalities. The autoencoders were trained on normal activities of daily living (ADL) performed by older adults and tested on ADLs and falls performed by young adults. Temporal shift shows significant improvement to a baseline 3D Convolutional autoencoder, an attention U-Net CAE, and a multi-modal neural network. The greatest improvement was observed in an attention U-Net model improving by 0.20 AUC ROC for a single camera when compared to reconstruction alone. With significant improvement across different models, this approach has the potential to be widely adopted and improve anomaly detection capabilities in other settings besides fall detection.
CVOct 31, 2023
StairNet: Visual Recognition of Stairs for Human-Robot LocomotionAndrew Garrett Kurbis, Dmytro Kuzmenko, Bogdan Ivanyuk-Skulskiy et al.
Human-robot walking with prosthetic legs and exoskeletons, especially over complex terrains such as stairs, remains a significant challenge. Egocentric vision has the unique potential to detect the walking environment prior to physical interactions, which can improve transitions to and from stairs. This motivated us to create the StairNet initiative to support the development of new deep learning models for visual sensing and recognition of stairs, with an emphasis on lightweight and efficient neural networks for onboard real-time inference. In this study, we present an overview of the development of our large-scale dataset with over 515,000 manually labeled images, as well as our development of different deep learning models (e.g., 2D and 3D CNN, hybrid CNN and LSTM, and ViT networks) and training methods (e.g., supervised learning with temporal data and semi-supervised learning with unlabeled images) using our new dataset. We consistently achieved high classification accuracy (i.e., up to 98.8%) with different designs, offering trade-offs between model accuracy and size. When deployed on mobile devices with GPU and NPU accelerators, our deep learning models achieved inference speeds up to 2.8 ms. We also deployed our models on custom-designed CPU-powered smart glasses. However, limitations in the embedded hardware yielded slower inference speeds of 1.5 seconds, presenting a trade-off between human-centered design and performance. Overall, we showed that StairNet can be an effective platform to develop and study new visual perception systems for human-robot locomotion with applications in exoskeleton and prosthetic leg control.
12.6HCApr 21
Remindful: Designing Reminder Systems for Caregiver Interpretation in Dementia CareJoy Lai, Alex Mihailidis
Digital reminder systems are widely used in dementia care to support everyday tasks, but they are typically designed for one-way prompting rather than helping caregivers interpret engagement over time. We present Remindful, a caregiver-informed reminder platform that extends task prompting with caregiver-facing alerts, summaries, and review features to support awareness in home-based dementia care. Drawing on formative caregiver interviews, lived-experience advisor input, and in-home deployments with two caregiver-PLwD dyads, we examine how reminder-based caregiver awareness functions in practice. Our findings show that reminder systems can support caregiver reassurance, household coordination, and awareness of routines over time, but that reminder interaction data is highly context-dependent. Household participation, prompt attribution, routine mismatch, accessibility barriers, and technical failures all shaped what reminder logs could reasonably mean. We argue that reminder systems should not be treated as neutral behavioral sensors, but designed as assistive infrastructures for caregiver interpretation that preserve uncertainty and support contextual sensemaking in real homes.
CVJul 25, 2025
SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial ExpressionsBabak Taati, Muhammad Muzammil, Yasamin Zarghami et al. · utoronto
Accurate pain assessment in patients with limited ability to communicate, such as older adults with dementia, represents a critical healthcare challenge. Robust automated systems of pain detection may facilitate such assessments. Existing pain detection datasets, however, suffer from limited ethnic/racial diversity, privacy constraints, and underrepresentation of older adults who are the primary target population for clinical deployment. We present SynPAIN, a large-scale synthetic dataset containing 10,710 facial expression images (5,355 neutral/expressive pairs) across five ethnicities/races, two age groups (young: 20-35, old: 75+), and two genders. Using commercial generative AI tools, we created demographically balanced synthetic identities with clinically meaningful pain expressions. Our validation demonstrates that synthetic pain expressions exhibit expected pain patterns, scoring significantly higher than neutral and non-pain expressions using clinically validated pain assessment tools based on facial action unit analysis. We experimentally demonstrate SynPAIN's utility in identifying algorithmic bias in existing pain detection models. Through comprehensive bias evaluation, we reveal substantial performance disparities across demographic characteristics. These performance disparities were previously undetectable with smaller, less diverse datasets. Furthermore, we demonstrate that age-matched synthetic data augmentation improves pain detection performance on real clinical data, achieving a 7.0% improvement in average precision. SynPAIN addresses critical gaps in pain assessment research by providing the first publicly available, demographically diverse synthetic dataset specifically designed for older adult pain detection, while establishing a framework for measuring and mitigating algorithmic bias. The dataset is available at https://doi.org/10.5683/SP3/WCXMAP
AINov 20, 2025
PersonaDrift: A Benchmark for Temporal Anomaly Detection in Language-Based Dementia MonitoringJoy Lai, Alex Mihailidis
People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.
CVMay 23, 2025
Dynamics of Affective States During Takeover Requests in Conditionally Automated Driving Among Older Adults with and without Cognitive ImpairmentGelareh Hajian, Ali Abedi, Bing Ye et al.
Driving is a key component of independence and quality of life for older adults. However, cognitive decline associated with conditions such as mild cognitive impairment and dementia can compromise driving safety and often lead to premature driving cessation. Conditionally automated vehicles, which require drivers to take over control when automation reaches its operational limits, offer a potential assistive solution. However, their effectiveness depends on the driver's ability to respond to takeover requests (TORs) in a timely and appropriate manner. Understanding emotional responses during TORs can provide insight into drivers' engagement, stress levels, and readiness to resume control, particularly in cognitively vulnerable populations. This study investigated affective responses, measured via facial expression analysis of valence and arousal, during TORs among cognitively healthy older adults and those with cognitive impairment. Facial affect data were analyzed across different road geometries and speeds to evaluate within- and between-group differences in affective states. Within-group comparisons using the Wilcoxon signed-rank test revealed significant changes in valence and arousal during TORs for both groups. Cognitively healthy individuals showed adaptive increases in arousal under higher-demand conditions, while those with cognitive impairment exhibited reduced arousal and more positive valence in several scenarios. Between-group comparisons using the Mann-Whitney U test indicated that cognitively impaired individuals displayed lower arousal and higher valence than controls across different TOR conditions. These findings suggest reduced emotional response and awareness in cognitively impaired drivers, highlighting the need for adaptive vehicle systems that detect affective states and support safe handovers for vulnerable users.
LGApr 15, 2021
Tracking agitation in people living with dementia in a care environmentShehroz S. Khan, Thaejaesh Sooriyakumaran, Katherine Rich et al.
Agitation is a symptom that communicates distress in people living with dementia (PwD), and that can place them and others at risk. In a long term care (LTC) environment, care staff track and document these symptoms as a way to detect when there has been a change in resident status to assess risk, and to monitor for response to interventions. However, this documentation can be time-consuming, and due to staffing constraints, episodes of agitation may go unobserved. This brings into question the reliability of these assessments, and presents an opportunity for technology to help track and monitor behavioural symptoms in dementia. In this paper, we present the outcomes of a 2 year real-world study performed in a dementia unit, where a multi-modal wearable device was worn by $20$ PwD. In line with a commonly used clinical documentation tool, this large multi-modal time-series data was analyzed to track the presence of episodes of agitation in 8-hour nursing shifts. The development of a baseline classification model (AUC=0.717) on this dataset and subsequent improvement (AUC= 0.779) lays the groundwork for automating the process of annotating agitation events in nursing charts.
LGMay 19, 2019
Spatio-Temporal Adversarial Learning for Detecting Unseen FallsShehroz S. Khan, Jacob Nogas, Alex Mihailidis
Fall detection is an important problem from both the health and machine learning perspective. A fall can lead to severe injuries, long term impairments or even death in some cases. In terms of machine learning, it presents a severely class imbalance problem with very few or no training data for falls owing to the fact that falls occur rarely. In this paper, we take an alternate philosophy to detect falls in the absence of their training data, by training the classifier on only the normal activities (that are available in abundance) and identifying a fall as an anomaly. To realize such a classifier, we use an adversarial learning framework, which comprises of a spatio-temporal autoencoder for reconstructing input video frames and a spatio-temporal convolution network to discriminate them against original video frames. 3D convolutions are used to learn spatial and temporal features from the input video frames. The adversarial learning of the spatio-temporal autoencoder will enable reconstructing the normal activities of daily living efficiently; thus, rendering detecting unseen falls plausible within this framework. We tested the performance of the proposed framework on camera sensing modalities that may preserve an individual's privacy (fully or partially), such as thermal and depth camera. Our results on three publicly available datasets show that the proposed spatio-temporal adversarial framework performed better than other baseline frame based (or spatial) adversarial learning methods.
CVMay 17, 2019
Limitations and Biases in Facial Landmark Detection -- An Empirical Study on Older Adults with DementiaAzin Asgarian, Shun Zhao, Ahmed B. Ashraf et al.
Accurate facial expression analysis is an essential step in various clinical applications that involve physical and mental health assessments of older adults (e.g. diagnosis of pain or depression). Although remarkable progress has been achieved toward developing robust facial landmark detection methods, state-of-the-art methods still face many challenges when encountering uncontrolled environments, different ranges of facial expressions, and different demographics of the population. A recent study has revealed that the health status of individuals can also affect the performance of facial landmark detection methods on front views of faces. In this work, we investigate this matter in a much greater context using seven facial landmark detection methods. We perform our evaluation not only on frontal faces but also on profile faces and in various regions of the face. Our results shed light on limitations of the existing methods and challenges of applying these methods in clinical settings by indicating: 1) a significant difference between the performance of state-of-the-art when tested on the profile or frontal faces of individuals with vs. without dementia; 2) insights on the existing bias for all regions of the face; and 3) the presence of this bias despite re-training/fine-tuning with various configurations of six datasets.
CVAug 30, 2018
DeepFall -- Non-invasive Fall Detection with Deep Spatio-Temporal Convolutional AutoencodersJacob Nogas, Shehroz S. Khan, Alex Mihailidis
Human falls rarely occur; however, detecting falls is very important from the health and safety perspective. Due to the rarity of falls, it is difficult to employ supervised classification techniques to detect them. Moreover, in these highly skewed situations it is also difficult to extract domain specific features to identify falls. In this paper, we present a novel framework, \textit{DeepFall}, which formulates the fall detection problem as an anomaly detection problem. The \textit{DeepFall} framework presents the novel use of deep spatio-temporal convolutional autoencoders to learn spatial and temporal features from normal activities using non-invasive sensing modalities. We also present a new anomaly scoring method that combines the reconstruction score of frames across a temporal window to detect unseen falls. We tested the \textit{DeepFall} framework on three publicly available datasets collected through non-invasive sensing modalities, thermal camera and depth cameras and show superior results in comparison to traditional autoencoder methods to identify unseen falls.
LGFeb 1, 2018
Bootstrapping and Multiple Imputation Ensemble Approaches for Missing DataShehroz S. Khan, Amir Ahmad, Alex Mihailidis
Presence of missing values in a dataset can adversely affect the performance of a classifier. Single and Multiple Imputation are normally performed to fill in the missing values. In this paper, we present several variants of combining single and multiple imputation with bootstrapping to create ensembles that can model uncertainty and diversity in the data, and that are robust to high missingness in the data. We present three ensemble strategies: bootstrapping on incomplete data followed by (i) single imputation and (ii) multiple imputation, and (iii) multiple imputation ensemble without bootstrapping. We perform an extensive evaluation of the performance of the these ensemble strategies on 8 datasets by varying the missingness ratio. Our results show that bootstrapping followed by multiple imputation using expectation maximization is the most robust method even at high missingness ratio (up to 30%). For small missingness ratio (up to 10%) most of the ensemble methods perform quivalently but better than single imputation. Kappa-error plots suggest that accurate classifiers with reasonable diversity is the reason for this behaviour. A consistent observation in all the datasets suggests that for small missingness (up to 10%), bootstrapping on incomplete data without any imputation produces equivalent results to other ensemble methods.
CVSep 6, 2014
Depth image hand tracking from an overhead perspective using partially labeled, unbalanced data: Development and real-world testingStephen Czarnuch, Alex Mihailidis
We present the development and evaluation of a hand tracking algorithm based on single depth images captured from an overhead perspective for use in the COACH prompting system. We train a random decision forest body part classifier using approximately 5,000 manually labeled, unbalanced, partially labeled training images. The classifier represents a random subset of pixels in each depth image with a learned probability density function across all trained body parts. A local mode-find approach is used to search for clusters present in the underlying feature space sampled by the classified pixels. In each frame, body part positions are chosen as the mode with the highest confidence. User hand positions are translated into hand washing task actions based on proximity to environmental objects. We validate the performance of the classifier and task action proposals on a large set of approximately 24,000 manually labeled images.
AIJun 25, 2012
Relational Approach to Knowledge Engineering for POMDP-based Assistance Systems as a Translation of a Psychological ModelMarek Grzes, Jesse Hoey, Shehroz Khan et al.
Assistive systems for persons with cognitive disabilities (e.g. dementia) are difficult to build due to the wide range of different approaches people can take to accomplishing the same task, and the significant uncertainties that arise from both the unpredictability of client's behaviours and from noise in sensor readings. Partially observable Markov decision process (POMDP) models have been used successfully as the reasoning engine behind such assistive systems for small multi-step tasks such as hand washing. POMDP models are a powerful, yet flexible framework for modelling assistance that can deal with uncertainty and utility. Unfortunately, POMDPs usually require a very labour intensive, manual procedure for their definition and construction. Our previous work has described a knowledge driven method for automatically generating POMDP activity recognition and context sensitive prompting systems for complex tasks. We call the resulting POMDP a SNAP (SyNdetic Assistance Process). The spreadsheet-like result of the analysis does not correspond to the POMDP model directly and the translation to a formal POMDP representation is required. To date, this translation had to be performed manually by a trained POMDP expert. In this paper, we formalise and automate this translation process using a probabilistic relational model (PRM) encoded in a relational database. We demonstrate the method by eliciting three assistance tasks from non-experts. We validate the resulting POMDP models using case-based simulations to show that they are reasonable for the domains. We also show a complete case study of a designer specifying one database, including an evaluation in a real-life experiment with a human actor.