Hamed Tabkhi

CV
h-index20
49papers
719citations
Novelty40%
AI Score55

49 Papers

CVDec 19, 2022Code
CHAD: Charlotte Anomaly Dataset

Armin Danesh Pazho, Ghazal Alinezhad Noghre, Babak Rahimi Ardabili et al.

In recent years, we have seen a significant interest in data-driven deep learning approaches for video anomaly detection, where an algorithm must determine if specific frames of a video contain abnormal behaviors. However, video anomaly detection is particularly context-specific, and the availability of representative datasets heavily limits real-world accuracy. Additionally, the metrics currently reported by most state-of-the-art methods often do not reflect how well the model will perform in real-world scenarios. In this article, we present the Charlotte Anomaly Dataset (CHAD). CHAD is a high-resolution, multi-camera anomaly dataset in a commercial parking lot setting. In addition to frame-level anomaly labels, CHAD is the first anomaly dataset to include bounding box, identity, and pose annotations for each actor. This is especially beneficial for skeleton-based anomaly detection, which is useful for its lower computational demand in real-world settings. CHAD is also the first anomaly dataset to contain multiple views of the same scene. With four camera views and over 1.15 million frames, CHAD is the largest fully annotated anomaly detection dataset including person annotations, collected from continuous video streams from stationary cameras for smart video surveillance applications. To demonstrate the efficacy of CHAD for training and evaluation, we benchmark two state-of-the-art skeleton-based anomaly detection algorithms on CHAD and provide comprehensive analysis, including both quantitative results and qualitative examination. The dataset is available at https://github.com/TeCSAR-UNCC/CHAD.

CVMar 10, 2023Code
A POV-based Highway Vehicle Trajectory Dataset and Prediction Architecture

Vinit Katariya, Ghazal Alinezhad Noghre, Armin Danesh Pazho et al.

Vehicle Trajectory datasets that provide multiple point-of-views (POVs) can be valuable for various traffic safety and management applications. Despite the abundance of trajectory datasets, few offer a comprehensive and diverse range of driving scenes, capturing multiple viewpoints of various highway layouts, merging lanes, and configurations. This limits their ability to capture the nuanced interactions between drivers, vehicles, and the roadway infrastructure. We introduce the \emph{Carolinas Highway Dataset (CHD\footnote{\emph{CHD} available at: \url{https://github.com/TeCSAR-UNCC/Carolinas\_Dataset}})}, a vehicle trajectory, detection, and tracking dataset. \emph{CHD} is a collection of 1.6 million frames captured in highway-based videos from eye-level and high-angle POVs at eight locations across Carolinas with 338,000 vehicle trajectories. The locations, timing of recordings, and camera angles were carefully selected to capture various road geometries, traffic patterns, lighting conditions, and driving behaviors. We also present \emph{PishguVe}\footnote{\emph{PishguVe} code available at: \url{https://github.com/TeCSAR-UNCC/PishguVe}}, a novel vehicle trajectory prediction architecture that uses attention-based graph isomorphism and convolutional neural networks. The results demonstrate that \emph{PishguVe} outperforms existing algorithms to become the new state-of-the-art (SotA) in bird's-eye, eye-level, and high-angle POV trajectory datasets. Specifically, it achieves a 12.50\% and 10.20\% improvement in ADE and FDE, respectively, over the current SotA on NGSIM dataset. Compared to best-performing models on CHD, \emph{PishguVe} achieves lower ADE and FDE on eye-level data by 14.58\% and 27.38\%, respectively, and improves ADE and FDE on high-angle data by 8.3\% and 6.9\%, respectively.

24.0CVApr 18Code
EdgeVTP: Exploration of Latency-efficient Trajectory Prediction for Edge-based Embedded Vision Applications

Seungjin Kim, Reza Jafarpourmarzouni, Christopher Neff et al.

Vehicle trajectory prediction is central to highway perception, but deployment on roadside edge devices necessitates bounded, deterministic end-to-end latency. We present EdgeVTP, an embedded-first trajectory predictor that combines interaction-aware graph modeling with a lightweight transformer backbone and a one-shot curve decoder. By predicting future motion as compact curve parameters (anchored at the last observed position) rather than horizon-scaled autoregressive waypoints, EdgeVTP reduces decoding overhead while producing smooth trajectories. To keep runtime predictable in crowded scenes, we explicitly bound interaction complexity via a locality graph with a hard neighbor cap. Across three highway benchmarks and two Jetson-class platforms, EdgeVTP achieves the lowest measured end-to-end latency under a protocol that includes graph construction and post-processing, while attaining state-of-the-art (SotA) prediction accuracy on two of the three datasets and competitive error on other benchmarks. Our code is available at https://github.com/SeungjinStevenKim/EdgeVTP.

CVAug 26, 2024Code
Towards Adaptive Human-centric Video Anomaly Detection: A Comprehensive Framework and A New Benchmark

Armin Danesh Pazho, Shanle Yao, Ghazal Alinezhad Noghre et al.

Human-centric Video Anomaly Detection (VAD) aims to identify human behaviors that deviate from normal. At its core, human-centric VAD faces substantial challenges, such as the complexity of diverse human behaviors, the rarity of anomalies, and ethical constraints. These challenges limit access to high-quality datasets and highlight the need for a dataset and framework supporting continual learning. Moving towards adaptive human-centric VAD, we introduce the HuVAD (Human-centric privacy-enhanced Video Anomaly Detection) dataset and a novel Unsupervised Continual Anomaly Learning (UCAL) framework. UCAL enables incremental learning, allowing models to adapt over time, bridging traditional training and real-world deployment. HuVAD prioritizes privacy by providing de-identified annotations and includes seven indoor/outdoor scenes, offering over 5x more pose-annotated frames than previous datasets. Our standard and continual benchmarks, utilize a comprehensive set of metrics, demonstrating that UCAL-enhanced models achieve superior performance in 82.14% of cases, setting a new state-of-the-art (SOTA). The dataset can be accessed at https://github.com/TeCSAR-UNCC/HuVAD.

CVApr 29, 2022Code
A Novel Fully Annotated Thermal Infrared Face Dataset: Recorded in Various Environment Conditions and Distances From The Camera

Roshanak Ashrafi, Mona Azarbayjania, Hamed Tabkhi

Facial thermography is one of the most popular research areas in infrared thermal imaging, with diverse applications in medical, surveillance, and environmental monitoring. However, in contrast to facial imagery in the visual spectrum, the lack of public datasets on facial thermal images is an obstacle to research improvement in this area. Thermal face imagery is still a relatively new research area to be evaluated and studied in different domains.The current thermal face datasets are limited in regards to the subjects' distance from the camera, the ambient temperature variation, and facial landmarks' localization. We address these gaps by presenting a new facial thermography dataset. This article makes two main contributions to the body of knowledge. First, it presents a comprehensive review and comparison of current public datasets in facial thermography. Second, it introduces and studies a novel public dataset on facial thermography, which we call it Charlotte-ThermalFace. Charlotte-ThermalFace contains more than10000 infrared thermal images in varying thermal conditions, several distances from the camera, and different head positions. The data is fully annotated with the facial landmarks, ambient temperature, relative humidity, the air speed of the room, distance to the camera, and subject thermal sensation at the time of capturing each image. Our dataset is the first publicly available thermal dataset annotated with the thermal sensation of each subject in different thermal conditions and one of the few datasets in raw 16-bit format. Finally, we present a preliminary analysis of the dataset to show the applicability and importance of the thermal conditions in facial thermography. The full dataset, including annotations, are freely available for research purpose at https://github.com/TeCSAR-UNCC/UNCC-ThermalFace

CVNov 14, 2023Code
VegaEdge: Edge AI Confluence Anomaly Detection for Real-Time Highway IoT-Applications

Vinit Katariya, Fatema-E- Jannat, Armin Danesh Pazho et al.

Vehicle anomaly detection plays a vital role in highway safety applications such as accident prevention, rapid response, traffic flow optimization, and work zone safety. With the surge of the Internet of Things (IoT) in recent years, there has arisen a pressing demand for Artificial Intelligence (AI) based anomaly detection methods designed to meet the requirements of IoT devices. Catering to this futuristic vision, we introduce a lightweight approach to vehicle anomaly detection by utilizing the power of trajectory prediction. Our proposed design identifies vehicles deviating from expected paths, indicating highway risks from different camera-viewing angles from real-world highway datasets. On top of that, we present VegaEdge - a sophisticated AI confluence designed for real-time security and surveillance applications in modern highway settings through edge-centric IoT-embedded platforms equipped with our anomaly detection approach. Extensive testing across multiple platforms and traffic scenarios showcases the versatility and effectiveness of VegaEdge. This work also presents the Carolinas Anomaly Dataset (CAD), to bridge the existing gap in datasets tailored for highway anomalies. In real-world scenarios, our anomaly detection approach achieves an AUC-ROC of 0.94, and our proposed VegaEdge design, on an embedded IoT platform, processes 738 trajectories per second in a typical highway setting. The dataset is available at https://github.com/TeCSAR-UNCC/Carolinas_Dataset#chd-anomaly-test-set .

9.9AIMay 6
Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

Vinit Katariya, Seungjin Kim, Curtis Craig et al.

Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary pedestrian refuges and curb extensions, on vehicle speed and safety. Using deep learning and perspective-based speed estimation, we evaluated driver behavior before and after interventions, with repeated post-installation monitoring in Week 1 and Week 2, in Minneapolis. Findings reveal that at unsignalized intersections, mean and 85th-percentile speeds fell by up to 18.75% and 16.56%, respectively, while pass-through traffic decreased by as much as 12.2%. Signalized intersections showed comparable reductions except one location, with mean and 85th-percentile speeds dropping by up to 20.0% and 17.19%. These results demonstrate the traffic-calming effectiveness of soft infrastructure and underscore the utility of AI-powered methods for rapid, low-cost, and evidence-based transport policy evaluation.

CVJan 9, 2023
Ancilia: Scalable Intelligent Video Surveillance for the Artificial Intelligence of Things

Armin Danesh Pazho, Christopher Neff, Ghazal Alinezhad Noghre et al.

With the advancement of vision-based artificial intelligence, the proliferation of the Internet of Things connected cameras, and the increasing societal need for rapid and equitable security, the demand for accurate real-time intelligent surveillance has never been higher. This article presents Ancilia, an end-to-end scalable, intelligent video surveillance system for the Artificial Intelligence of Things. Ancilia brings state-of-the-art artificial intelligence to real-world surveillance applications while respecting ethical concerns and performing high-level cognitive tasks in real-time. Ancilia aims to revolutionize the surveillance landscape, to bring more effective, intelligent, and equitable security to the field, resulting in safer and more secure communities without requiring people to compromise their right to privacy.

LGJun 8, 2022
A Survey of Graph-based Deep Learning for Anomaly Detection in Distributed Systems

Armin Danesh Pazho, Ghazal Alinezhad Noghre, Arnab A Purkayastha et al.

Anomaly detection is a crucial task in complex distributed systems. A thorough understanding of the requirements and challenges of anomaly detection is pivotal to the security of such systems, especially for real-world deployment. While there are many works and application domains that deal with this problem, few have attempted to provide an in-depth look at such systems. In this survey, we explore the potentials of graph-based algorithms to identify anomalies in distributed systems. These systems can be heterogeneous or homogeneous, which can result in distinct requirements. One of our objectives is to provide an in-depth look at graph-based approaches to conceptually analyze their capability to handle real-world challenges such as heterogeneity and dynamic structure. This study gives an overview of the State-of-the-Art (SotA) research articles in the field and compare and contrast their characteristics. To facilitate a more comprehensive understanding, we present three systems with varying abstractions as use cases. We examine the specific challenges involved in anomaly detection within such systems. Subsequently, we elucidate the efficacy of graphs in such systems and explicate their advantages. We then delve into the SotA methods and highlight their strength and weaknesses, pointing out the areas for possible improvements and future works.

CYFeb 8, 2023
Understanding Policy and Technical Aspects of AI-Enabled Smart Video Surveillance to Address Public Safety

Babak Rahimi Ardabili, Armin Danesh Pazho, Ghazal Alinezhad Noghre et al.

Recent advancements in artificial intelligence (AI) have seen the emergence of smart video surveillance (SVS) in many practical applications, particularly for building safer and more secure communities in our urban environments. Cognitive tasks, such as identifying objects, recognizing actions, and detecting anomalous behaviors, can produce data capable of providing valuable insights to the community through statistical and analytical tools. However, artificially intelligent surveillance systems design requires special considerations for ethical challenges and concerns. The use and storage of personally identifiable information (PII) commonly pose an increased risk to personal privacy. To address these issues, this paper identifies the privacy concerns and requirements needed to address when designing AI-enabled smart video surveillance. Further, we propose the first end-to-end AI-enabled privacy-preserving smart video surveillance system that holistically combines computer vision analytics, statistical data analytics, cloud-native services, and end-user applications. Finally, we propose quantitative and qualitative metrics to evaluate intelligent video surveillance systems. The system shows the 17.8 frame-per-second (FPS) processing in extreme video scenes. However, considering privacy in designing such a system results in preferring the pose-based algorithm to the pixel-based one. This choice resulted in dropping accuracy in both action and anomaly detection tasks. The results drop from 97.48 to 73.72 in anomaly detection and 96 to 83.07 in the action detection task. On average, the latency of the end-to-end system is 36.1 seconds.

CVSep 17, 2024
Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification

Fatema-E- Jannat, Sina Gholami, Jennifer I. Lim et al.

In the medical domain, acquiring large datasets poses significant challenges due to privacy concerns. Nonetheless, the development of a robust deep-learning model for retinal disease diagnosis necessitates a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. The scarcity of data presents a significant barrier to the practical implementation of scalable medical AI solutions. To address this issue, we've combined a wide range of data sources to improve performance and generalization to new data by giving it a deeper understanding of the data representation from multi-modal datasets and developed a self-supervised framework based on large language models (LLMs), SwinV2 to gain a deeper understanding of multi-modal dataset representations, enhancing the model's ability to extrapolate to new data for the detection of eye diseases using optical coherence tomography (OCT) images. We adopt a two-phase training methodology, self-supervised pre-training, and fine-tuning on a downstream supervised classifier. An ablation study conducted across three datasets employing various encoder backbones, without data fusion, with low data availability setting, and without self-supervised pre-training scenarios, highlights the robustness of our method. Our findings demonstrate consistent performance across these diverse conditions, showcasing superior generalization capabilities compared to the baseline model, ResNet-50.

CVOct 14, 2022
Pishgu: Universal Path Prediction Network Architecture for Real-time Cyber-physical Edge Systems

Ghazal Alinezhad Noghre, Vinit Katariya, Armin Danesh Pazho et al.

Path prediction is an essential task for many real-world Cyber-Physical Systems (CPS) applications, from autonomous driving and traffic monitoring/management to pedestrian/worker safety. These real-world CPS applications need a robust, lightweight path prediction that can provide a universal network architecture for multiple subjects (e.g., pedestrians and vehicles) from different perspectives. However, most existing algorithms are tailor-made for a unique subject with a specific camera perspective and scenario. This article presents Pishgu, a universal lightweight network architecture, as a robust and holistic solution for path prediction. Pishgu's architecture can adapt to multiple path prediction domains with different subjects (vehicles, pedestrians), perspectives (bird's-eye, high-angle), and scenes (sidewalk, highway). Our proposed architecture captures the inter-dependencies within the subjects in each frame by taking advantage of Graph Isomorphism Networks and the attention module. We separately train and evaluate the efficacy of our architecture on three different CPS domains across multiple perspectives (vehicle bird's-eye view, pedestrian bird's-eye view, and human high-angle view). Pishgu outperforms state-of-the-art solutions in the vehicle bird's-eye view domain by 42% and 61% and pedestrian high-angle view domain by 23% and 22% in terms of ADE and FDE, respectively. Additionally, we analyze the domain-specific details for various datasets to understand their effect on path prediction and model interpretation. Finally, we report the latency and throughput for all three domains on multiple embedded platforms showcasing the robustness and adaptability of Pishgu for real-world integration into CPS applications.

CYDec 25, 2022
Understanding Ethics, Privacy, and Regulations in Smart Video Surveillance for Public Safety

Babak Rahimi Ardabili, Armin Danesh Pazho, Ghazal Alinezhad Noghre et al.

Recently, Smart Video Surveillance (SVS) systems have been receiving more attention among scholars and developers as a substitute for the current passive surveillance systems. These systems are used to make the policing and monitoring systems more efficient and improve public safety. However, the nature of these systems in monitoring the public's daily activities brings different ethical challenges. There are different approaches for addressing privacy issues in implementing the SVS. In this paper, we are focusing on the role of design considering ethical and privacy challenges in SVS. Reviewing four policy protection regulations that generate an overview of best practices for privacy protection, we argue that ethical and privacy concerns could be addressed through four lenses: algorithm, system, model, and data. As an case study, we describe our proposed system and illustrate how our system can create a baseline for designing a privacy perseverance system to deliver safety to society. We used several Artificial Intelligence algorithms, such as object detection, single and multi camera re-identification, action recognition, and anomaly detection, to provide a basic functional system. We also use cloud-native services to implement a smartphone application in order to deliver the outputs to the end users.

CVMar 9, 2023
Understanding the Challenges and Opportunities of Pose-based Anomaly Detection

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Vinit Katariya et al.

Pose-based anomaly detection is a video-analysis technique for detecting anomalous events or behaviors by examining human pose extracted from the video frames. Utilizing pose data alleviates privacy and ethical issues. Also, computation-wise, the complexity of pose-based models is lower than pixel-based approaches. However, it introduces more challenges, such as noisy skeleton data, losing important pixel information, and not having enriched enough features. These problems are exacerbated by a lack of anomaly detection datasets that are good enough representatives of real-world scenarios. In this work, we analyze and quantify the characteristics of two well-known video anomaly datasets to better understand the difficulties of pose-based anomaly detection. We take a step forward, exploring the discriminating power of pose and trajectory for video anomaly detection and their effectiveness based on context. We believe these experiments are beneficial for a better comprehension of pose-based anomaly detection and the datasets currently available. This will aid researchers in tackling the task of anomaly detection with a more lucid perspective, accelerating the development of robust models with better performance.

CVNov 11, 2023
VT-Former: An Exploratory Study on Vehicle Trajectory Prediction for Highway Surveillance through Graph Isomorphism and Transformer

Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya et al.

Enhancing roadway safety has become an essential computer vision focus area for Intelligent Transportation Systems (ITS). As a part of ITS, Vehicle Trajectory Prediction (VTP) aims to forecast a vehicle's future positions based on its past and current movements. VTP is a pivotal element for road safety, aiding in applications such as traffic management, accident prevention, work-zone safety, and energy optimization. While most works in this field focus on autonomous driving, with the growing number of surveillance cameras, another sub-field emerges for surveillance VTP with its own set of challenges. In this paper, we introduce VT-Former, a novel transformer-based VTP approach for highway safety and surveillance. In addition to utilizing transformers to capture long-range temporal patterns, a new Graph Attentive Tokenization (GAT) module has been proposed to capture intricate social interactions among vehicles. This study seeks to explore both the advantages and the limitations inherent in combining transformer architecture with graphs for VTP. Our investigation, conducted across three benchmark datasets from diverse surveillance viewpoints, showcases the State-of-the-Art (SotA) or comparable performance of VT-Former in predicting vehicle trajectories. This study underscores the potential of VT-Former and its architecture, opening new avenues for future research and exploration.

LGMar 27, 2023
Real-Time Bus Arrival Prediction: A Deep Learning Approach for Enhanced Urban Mobility

Narges Rashvand, Sanaz Sadat Hosseini, Mona Azarbayjani et al.

In urban settings, bus transit stands as a significant mode of public transportation, yet faces hurdles in delivering accurate and reliable arrival times. This discrepancy often culminates in delays and a decline in ridership, particularly in areas with a heavy reliance on bus transit. A prevalent challenge is the mismatch between actual bus arrival times and their scheduled counterparts, leading to disruptions in fixed schedules. Our study, utilizing New York City bus data, reveals an average delay of approximately eight minutes between scheduled and actual bus arrival times. This research introduces an innovative, AI-based, data-driven methodology for predicting bus arrival times at various transit points (stations), offering a collective prediction for all bus lines within large metropolitan areas. Through the deployment of a fully connected neural network, our method elevates the accuracy and efficiency of public bus transit systems. Our comprehensive evaluation encompasses over 200 bus lines and 2 million data points, showcasing an error margin of under 40 seconds for arrival time estimates. Additionally, the inference time for each data point in the validation set is recorded at below 0.006 ms, demonstrating the potential of our Neural-Net-based approach in substantially enhancing the punctuality and reliability of bus transit systems.

CVApr 14, 2022
Machine Learning-Based Automated Thermal Comfort Prediction: Integration of Low-Cost Thermal and Visual Cameras for Higher Accuracy

Roshanak Ashrafi, Mona Azarbayjani, Hamed Tabkhi

Recent research is trying to leverage occupants' demand in the building's control loop to consider individuals' well-being and the buildings' energy savings. To that end, a real-time feedback system is needed to provide data about occupants' comfort conditions that can be used to control the building's heating, cooling, and air conditioning (HVAC) system. The emergence of thermal imaging techniques provides an excellent opportunity for contactless data gathering with no interruption in occupant conditions and activities. There is increasing attention to infrared thermal camera usage in public buildings because of their non-invasive quality in reading the human skin temperature. However, the state-of-the-art methods need additional modifications to become more reliable. To capitalize potentials and address some existing limitations, new solutions are required to bring a more holistic view toward non-intrusive thermal scanning by leveraging the benefit of machine learning and image processing. This research implements an automated approach to collect and register simultaneous thermal and visual images and read the facial temperature in different regions. This paper also presents two additional investigations. First, through utilizing IButton wearable thermal sensors on the forehead area, we investigate the reliability of an in-expensive thermal camera (FLIR Lepton) in reading the skin temperature. Second, by studying the false-color version of thermal images, we look into the possibility of non-radiometric thermal images for predicting personalized thermal comfort. The results shows the strong performance of Random Forest and K-Nearest Neighbor prediction algorithms in predicting personalized thermal comfort. In addition, we have found that non-radiometric images can also indicate thermal comfort when the algorithm is trained with larger amounts of data.

14.4CYApr 20
Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA

Sanaz Sadat Hosseini, Mona Azarbayjani, Mohammad Pourhomayoun et al.

Climate-driven wildfires are intensifying, particularly in urban regions such as Southern California. Yet, traditional fire risk communication tools often fail to gain public trust due to inaccessible design, non-transparent outputs, and limited contextual relevance. These challenges are especially critical in high-risk communities, where trust depends on how clearly and locally information is presented. Neighborhoods such as Pacific Palisades, Pasadena, and Altadena in Los Angeles exemplify these conditions. This study introduces a community-led approach for integrating AI into wildfire risk assessment using the Participatory AI Literacy and Explainability Integration (PALEI) framework. PALEI emphasizes early literacy building, value alignment, and participatory evaluation before deploying predictive models, prioritizing clarity, accessibility, and mutual learning between developers and residents. Early engagement findings show strong acceptance of visual, context-specific risk communication, positive fairness perceptions, and clear adoption interest, alongside privacy and data security concerns that influence trust. Participants emphasized localized imagery, accessible explanations, neighborhood-specific mitigation guidance, and transparent communication of uncertainty. The outcome is a mobile application co-designed with users and stakeholders, enabling residents to scan visible property features and receive interpretable fire risk scores with tailored recommendations. By embedding local context into design, the tool becomes an everyday resource for risk awareness and preparedness. This study argues that user experience is central to ethical and effective AI deployment and provides a replicable, literacy-first pathway for applying the PALEI framework to climate-related hazards.

CVApr 29, 2024Code
An Exploratory Study on Human-Centric Video Anomaly Detection through Variational Autoencoders and Trajectory Prediction

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

Video Anomaly Detection (VAD) represents a challenging and prominent research task within computer vision. In recent years, Pose-based Video Anomaly Detection (PAD) has drawn considerable attention from the research community due to several inherent advantages over pixel-based approaches despite the occasional suboptimal performance. Specifically, PAD is characterized by reduced computational complexity, intrinsic privacy preservation, and the mitigation of concerns related to discrimination and bias against specific demographic groups. This paper introduces TSGAD, a novel human-centric Two-Stream Graph-Improved Anomaly Detection leveraging Variational Autoencoders (VAEs) and trajectory prediction. TSGAD aims to explore the possibility of utilizing VAEs as a new approach for pose-based human-centric VAD alongside the benefits of trajectory prediction. We demonstrate TSGAD's effectiveness through comprehensive experimentation on benchmark datasets. TSGAD demonstrates comparable results with state-of-the-art methods showcasing the potential of adopting variational autoencoders. This suggests a promising direction for future research endeavors. The code base for this work is available at https://github.com/TeCSAR-UNCC/TSGAD.

CVAug 27, 2024
Human-Centric Video Anomaly Detection Through Spatio-Temporal Pose Tokenization and Transformer

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

Video Anomaly Detection (VAD) presents a significant challenge in computer vision, particularly due to the unpredictable and infrequent nature of anomalous events, coupled with the diverse and dynamic environments in which they occur. Human-centric VAD, a specialized area within this domain, faces additional complexities, including variations in human behavior, potential biases in data, and substantial privacy concerns related to human subjects. These issues complicate the development of models that are both robust and generalizable. To address these challenges, recent advancements have focused on pose-based VAD, which leverages human pose as a high-level feature to mitigate privacy concerns, reduce appearance biases, and minimize background interference. In this paper, we introduce SPARTA, a novel transformer-based architecture designed specifically for human-centric pose-based VAD. SPARTA introduces an innovative Spatio-Temporal Pose and Relative Pose (ST-PRP) tokenization method that produces an enriched representation of human motion over time. This approach ensures that the transformer's attention mechanism captures both spatial and temporal patterns simultaneously, rather than focusing on only one aspect. The addition of the relative pose further emphasizes subtle deviations from normal human movements. The architecture's core, a novel Unified Encoder Twin Decoders (UETD) transformer, significantly improves the detection of anomalous behaviors in video data. Extensive evaluations across multiple benchmark datasets demonstrate that SPARTA consistently outperforms existing methods, establishing a new state-of-the-art in pose-based VAD.

CVMar 22, 2023
Real-World Community-in-the-Loop Smart Video Surveillance -- A Case Study at a Community College

Shanle Yao, Babak Rahimi Ardabili, Armin Danesh Pazho et al.

Smart Video surveillance systems have become important recently for ensuring public safety and security, especially in smart cities. However, applying real-time artificial intelligence technologies combined with low-latency notification and alarming has made deploying these systems quite challenging. This paper presents a case study for designing and deploying smart video surveillance systems based on a real-world testbed at a community college. We primarily focus on a smart camera-based system that can identify suspicious/abnormal activities and alert the stakeholders and residents immediately. The paper highlights and addresses different algorithmic and system design challenges to guarantee real-time high-accuracy video analytics processing in the testbed. It also presents an example of cloud system infrastructure and a mobile application for real-time notification to keep students, faculty/staff, and responsible security personnel in the loop. At the same time, it covers the design decision to maintain communities' privacy and ethical requirements as well as hardware configuration and setups. We evaluate the system's performance using throughput and end-to-end latency. The experiment results show that, on average, our system's end-to-end latency to notify the end users in case of detecting suspicious objects is 5.3, 5.78, and 11.11 seconds when running 1, 4, and 8 cameras, respectively. On the other hand, in case of detecting anomalous behaviors, the system could notify the end users with 7.3, 7.63, and 20.78 seconds average latency. These results demonstrate that the system effectively detects and notifies abnormal behaviors and suspicious objects to the end users within a reasonable period. The system can run eight cameras simultaneously at a 32.41 Frame Per Second (FPS) rate.

CVJun 6, 2023
Real-Time Online Unsupervised Domain Adaptation for Real-World Person Re-identification

Christopher Neff, Armin Danesh Pazho, Hamed Tabkhi

Following the popularity of Unsupervised Domain Adaptation (UDA) in person re-identification, the recently proposed setting of Online Unsupervised Domain Adaptation (OUDA) attempts to bridge the gap towards practical applications by introducing a consideration of streaming data. However, this still falls short of truly representing real-world applications. This paper defines the setting of Real-world Real-time Online Unsupervised Domain Adaptation (R$^2$OUDA) for Person Re-identification. The R$^2$OUDA setting sets the stage for true real-world real-time OUDA, bringing to light four major limitations found in real-world applications that are often neglected in current research: system generated person images, subset distribution selection, time-based data stream segmentation, and a segment-based time constraint. To address all aspects of this new R$^2$OUDA setting, this paper further proposes Real-World Real-Time Online Streaming Mutual Mean-Teaching (R$^2$MMT), a novel multi-camera system for real-world person re-identification. Taking a popular person re-identification dataset, R$^2$MMT was used to construct over 100 data subsets and train more than 3000 models, exploring the breadth of the R$^2$OUDA setting to understand the training time and accuracy trade-offs and limitations for real-world applications. R$^2$MMT, a real-world system able to respect the strict constraints of the proposed R$^2$OUDA setting, achieves accuracies within 0.1% of comparable OUDA methods that cannot be applied directly to real-world applications.

13.5CVApr 24
Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography

Tania Haghighi, Sina Gholami, Hamed Tabkhi et al.

Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.

CVJan 11, 2025Code
Exploring Pose-Based Anomaly Detection for Retail Security: A Real-World Shoplifting Dataset and Benchmark

Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho et al.

Shoplifting poses a significant challenge for retailers, resulting in billions of dollars in annual losses. Traditional security measures often fall short, highlighting the need for intelligent solutions capable of detecting shoplifting behaviors in real time. This paper frames shoplifting detection as an anomaly detection problem, focusing on the identification of deviations from typical shopping patterns. We introduce PoseLift, a privacy-preserving dataset specifically designed for shoplifting detection, addressing challenges such as data scarcity, privacy concerns, and model biases. PoseLift is built in collaboration with a retail store and contains anonymized human pose data from real-world scenarios. By preserving essential behavioral information while anonymizing identities, PoseLift balances privacy and utility. We benchmark state-of-the-art pose-based anomaly detection models on this dataset, evaluating performance using a comprehensive set of metrics. Our results demonstrate that pose-based approaches achieve high detection accuracy while effectively addressing privacy and bias concerns inherent in traditional methods. As one of the first datasets capturing real-world shoplifting behaviors, PoseLift offers researchers a valuable tool to advance computer vision ethically and will be publicly available to foster innovation and collaboration. The dataset is available at https://github.com/TeCSAR-UNCC/PoseLift.

CVApr 28, 2025Code
Shopformer: Transformer-Based Framework for Detecting Shoplifting via Human Pose

Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho et al.

Shoplifting remains a costly issue for the retail sector, but traditional surveillance systems, which are mostly based on human monitoring, are still largely ineffective, with only about 2% of shoplifters being arrested. Existing AI-based approaches rely on pixel-level video analysis which raises privacy concerns, is sensitive to environmental variations, and demands significant computational resources. To address these limitations, we introduce Shopformer, a novel transformer-based model that detects shoplifting by analyzing pose sequences rather than raw video. We propose a custom tokenization strategy that converts pose sequences into compact embeddings for efficient transformer processing. To the best of our knowledge, this is the first pose-sequence-based transformer model for shoplifting detection. Evaluated on real-world pose data, our method outperforms state-of-the-art anomaly detection models, offering a privacy-preserving, and scalable solution for real-time retail surveillance. The code base for this work is available at https://github.com/TeCSAR-UNCC/Shopformer.

45.4CVApr 10Code
From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

Narges Rashvand, Shanle Yao, Armin Danesh Pazho et al.

Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.

CVSep 23, 2025Code
Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

Gabriel Maldonado, Narges Rashvand, Armin Danesh Pazho et al.

Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method's superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion's complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.

CVJan 15, 2022Code
Real-World Graph Convolution Networks (RW-GCNs) for Action Recognition in Smart Video Surveillance

Justin Sanchez, Christopher Neff, Hamed Tabkhi

Action recognition is a key algorithmic part of emerging on-the-edge smart video surveillance and security systems. Skeleton-based action recognition is an attractive approach which, instead of using RGB pixel data, relies on human pose information to classify appropriate actions. However, existing algorithms often assume ideal conditions that are not representative of real-world limitations, such as noisy input, latency requirements, and edge resource constraints. To address the limitations of existing approaches, this paper presents Real-World Graph Convolution Networks (RW-GCNs), an architecture-level solution for meeting the domain constraints of Real World Skeleton-based Action Recognition. Inspired by the presence of feedback connections in the human visual cortex, RW-GCNs leverage attentive feedback augmentation on existing near state-of-the-art (SotA) Spatial-Temporal Graph Convolution Networks (ST-GCNs). The ST-GCNs' design choices are derived from information theory-centric principles to address both the spatial and temporal noise typically encountered in end-to-end real-time and on-the-edge smart video systems. Our results demonstrate RW-GCNs' ability to serve these applications by achieving a new SotA accuracy on the NTU-RGB-D-120 dataset at 94.1%, and achieving 32X less latency than baseline ST-GCN applications while still achieving 90.4% accuracy on the Northwestern UCLA dataset in the presence of spatial keypoint noise. RW-GCNs further show system scalability by running on the 10X cost effective NVIDIA Jetson Nano (as opposed to NVIDIA Xavier NX), while still maintaining a respectful range of throughput (15.6 to 5.5 Actions per Second) on the resource constrained device. The code is available here: https://github.com/TeCSAR-UNCC/RW-GCN.

LGAug 19, 2019Code
Real-time Person Re-identification at the Edge: A Mixed Precision Approach

Mohammadreza Baharani, Shrey Mohan, Hamed Tabkhi

A critical part of multi-person multi-camera tracking is person re-identification (re-ID) algorithm, which recognizes and retains identities of all detected unknown people throughout the video stream. Many re-ID algorithms today exemplify state of the art results, but not much work has been done to explore the deployment of such algorithms for computation and power constrained real-time scenarios. In this paper, we study the effect of using a light-weight model, MobileNet-v2 for re-ID and investigate the impact of single (FP32) precision versus half (FP16) precision for training on the server and inference on the edge nodes. We further compare the results with the baseline model which uses ResNet-50 on state of the art benchmarks including CUHK03, Market-1501, and Duke-MTMC. The MobileNet-V2 mixed precision training method can improve both inference throughput on the edge node, and training time on server $3.25\times$ reaching to 27.77fps and $1.75\times$, respectively and decreases power consumption on the edge node by $1.45\times$, while it deteriorates accuracy only 5.6\% in respect to ResNet-50 single precision on the average for three different datasets. The code and pre-trained networks are publicly available at https://github.com/TeCSAR-UNCC/person-reid.

SPMar 8, 2024
Enhancing Automatic Modulation Recognition for IoT Applications Using Transformers

Narges Rashvand, Kenneth Witham, Gabriel Maldonado et al.

Automatic modulation recognition (AMR) is vital for accurately identifying modulation types within incoming signals, a critical task for optimizing operations within edge devices in IoT ecosystems. This paper presents an innovative approach that leverages Transformer networks, initially designed for natural language processing, to address the challenges of efficient AMR. Our transformer network architecture is designed with the mindset of real-time edge computing on IoT devices. Four tokenization techniques are proposed and explored for creating proper embeddings of RF signals, specifically focusing on overcoming the limitations related to the model size often encountered in IoT scenarios. Extensive experiments reveal that our proposed method outperformed advanced deep learning techniques, achieving the highest recognition accuracy. Notably, our model achieves an accuracy of 65.75 on the RML2016 and 65.80 on the CSPB.ML.2018+ dataset.

CVApr 29, 2024
Evaluating the Effectiveness of Video Anomaly Detection in the Wild: Online Learning and Inference for Real-world Deployment

Shanle Yao, Ghazal Alinezhad Noghre, Armin Danesh Pazho et al.

Video Anomaly Detection (VAD) identifies unusual activities in video streams, a key technology with broad applications ranging from surveillance to healthcare. Tackling VAD in real-life settings poses significant challenges due to the dynamic nature of human actions, environmental variations, and domain shifts. Many research initiatives neglect these complexities, often concentrating on traditional testing methods that fail to account for performance on unseen datasets, creating a gap between theoretical models and their real-world utility. Online learning is a potential strategy to mitigate this issue by allowing models to adapt to new information continuously. This paper assesses how well current VAD algorithms can adjust to real-life conditions through an online learning framework, particularly those based on pose analysis, for their efficiency and privacy advantages. Our proposed framework enables continuous model updates with streaming data from novel environments, thus mirroring actual world challenges and evaluating the models' ability to adapt in real-time while maintaining accuracy. We investigate three state-of-the-art models in this setting, focusing on their adaptability across different domains. Our findings indicate that, even under the most challenging conditions, our online learning approach allows a model to preserve 89.39% of its original effectiveness compared to its offline-trained counterpart in a specific target domain.

LGJan 17, 2025
Real-Time Bus Departure Prediction Using Neural Networks for Smart IoT Public Bus Transit

Narges Rashvand, Sanaz Sadat Hosseini, Mona Azarbayjani et al.

Bus transit plays a vital role in urban public transportation but often struggles to provide accurate and reliable departure times. This leads to delays, passenger dissatisfaction, and decreased ridership, particularly in transit-dependent areas. A major challenge lies in the discrepancy between actual and scheduled bus departure times, which disrupts timetables and impacts overall operational efficiency. To address these challenges, this paper presents a neural network-based approach for real-time bus departure time prediction tailored for smart IoT public transit applications. We leverage AI-driven models to enhance the accuracy of bus schedules by preprocessing data, engineering relevant features, and implementing a fully connected neural network that utilizes historical departure data to predict departure times at subsequent stops. In our case study analyzing bus data from Boston, we observed an average deviation of nearly 4 minutes from scheduled times. However, our model, evaluated across 151 bus routes, demonstrates a significant improvement, predicting departure time deviations with an accuracy of under 80 seconds. This advancement not only improves the reliability of bus transit schedules but also plays a crucial role in enabling smart bus systems and IoT applications within public transit networks. By providing more accurate real-time predictions, our approach can facilitate the integration of IoT devices, such as smart bus stops and passenger information systems, that rely on precise data for optimal performance.

CVDec 4, 2023
From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety

Shanle Yao, Babak Rahimi Ardabili, Armin Danesh Pazho et al.

This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system's robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.

AIMar 5
From Offline to Periodic Adaptation for Pose-Based Shoplifting Detection in Real-world Retail Security

Shanle Yao, Narges Rashvand, Armin Danesh Pazho et al.

Shoplifting is a growing operational and economic challenge for retailers, with incidents rising and losses increasing despite extensive video surveillance. Continuous human monitoring is infeasible, motivating automated, privacy-preserving, and resource-aware detection solutions. In this paper, we cast shoplifting detection as a pose-based, unsupervised video anomaly detection problem and introduce a periodic adaptation framework designed for on-site Internet of Things (IoT) deployment. Our approach enables edge devices in smart retail environments to adapt from streaming, unlabeled data, supporting scalable and low-latency anomaly detection across distributed camera networks. To support reproducibility, we introduce RetailS, a new large-scale real-world shoplifting dataset collected from a retail store under multi-day, multi-camera conditions, capturing unbiased shoplifting behavior in realistic IoT settings. For deployable operation, thresholds are selected using both F1 and H_PRS scores, the harmonic mean of precision, recall, and specificity, during data filtering and training. In periodic adaptation experiments, our framework consistently outperformed offline baselines on AUC-ROC and AUC-PR in 91.6% of evaluations, with each training update completing in under 30 minutes on edge-grade hardware, demonstrating the feasibility and reliability of our solution for IoT-enabled smart retail deployment.

CVFeb 8, 2025
MoFM: A Large-Scale Human Motion Foundation Model

Mohammadreza Baharani, Ghazal Alinezhad Noghre, Armin Danesh Pazho et al.

Foundation Models (FM) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. Inspired by the success of FMs and the principles that have driven advancements in Large Language Models (LLMs), we introduce MoFM as a novel Motion Foundation Model. MoFM is designed for the semantic understanding of complex human motions in both time and space. To facilitate large-scale training, MotionBook, a comprehensive human motion dictionary of discretized motions is designed and employed. MotionBook utilizes Thermal Cubes to capture spatio-temporal motion heatmaps, applying principles from discrete variational models to encode human movements into discrete units for a more efficient and scalable representation. MoFM, trained on a large corpus of motion data, provides a foundational backbone adaptable to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks. This versatility makes MoFM well-suited for a wide range of motion-based applications.

CVJan 22, 2024
OCT-SelfNet: A Self-Supervised Framework with Multi-Modal Datasets for Generalized and Robust Retinal Disease Detection

Fatema-E Jannat, Sina Gholami, Minhaj Nur Alam et al.

Despite the revolutionary impact of AI and the development of locally trained algorithms, achieving widespread generalized learning from multi-modal data in medical AI remains a significant challenge. This gap hinders the practical deployment of scalable medical AI solutions. Addressing this challenge, our research contributes a self-supervised robust machine learning framework, OCT-SelfNet, for detecting eye diseases using optical coherence tomography (OCT) images. In this work, various data sets from various institutions are combined enabling a more comprehensive range of representation. Our method addresses the issue using a two-phase training approach that combines self-supervised pretraining and supervised fine-tuning with a mask autoencoder based on the SwinV2 backbone by providing a solution for real-world clinical deployment. Extensive experiments on three datasets with different encoder backbones, low data settings, unseen data settings, and the effect of augmentation show that our method outperforms the baseline model, Resnet-50 by consistently attaining AUC-ROC performance surpassing 77% across all tests, whereas the baseline model exceeds 54%. Moreover, in terms of the AUC-PR metric, our proposed method exceeded 42%, showcasing a substantial increase of at least 10% in performance compared to the baseline, which exceeded only 33%. This contributes to our understanding of our approach's potential and emphasizes its usefulness in clinical settings.

CVAug 12, 2025
ALFred: An Active Learning Framework for Real-world Semi-supervised Anomaly Detection with Adaptive Thresholds

Shanle Yao, Ghazal Alinezhad Noghre, Armin Danesh Pazho et al.

Video Anomaly Detection (VAD) can play a key role in spotting unusual activities in video footage. VAD is difficult to use in real-world settings due to the dynamic nature of human actions, environmental variations, and domain shifts. Traditional evaluation metrics often prove inadequate for such scenarios, as they rely on static assumptions and fall short of identifying a threshold that distinguishes normal from anomalous behavior in dynamic settings. To address this, we introduce an active learning framework tailored for VAD, designed for adapting to the ever-changing real-world conditions. Our approach leverages active learning to continuously select the most informative data points for labeling, thereby enhancing model adaptability. A critical innovation is the incorporation of a human-in-the-loop mechanism, which enables the identification of actual normal and anomalous instances from pseudo-labeling results generated by AI. This collected data allows the framework to define an adaptive threshold tailored to different environments, ensuring that the system remains effective as the definition of 'normal' shifts across various settings. Implemented within a lab-based framework that simulates real-world conditions, our approach allows rigorous testing and refinement of VAD algorithms with a new metric. Experimental results show that our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness and significantly enhancing the applicability of VAD in dynamic environments.

CYMar 5, 2024
Understanding the Transit Gap: A Comparative Study of On-Demand Bus Services and Urban Climate Resilience in South End, Charlotte, NC and Avondale, Chattanooga, TN

Sanaz Sadat Hosseini, Babak Rahimi Ardabili, Mona Azarbayjani et al.

Urban design significantly impacts sustainability, particularly in the context of public transit efficiency and carbon emissions reduction. This study explores two neighborhoods with distinct urban designs: South End, Charlotte, NC, featuring a dynamic mixed-use urban design pattern, and Avondale, Chattanooga, TN, with a residential suburban grid layout. Using the TRANSIT-GYM tool, we assess the impact of increased bus utilization in these different urban settings on traffic and CO2 emissions. Our results highlight the critical role of urban design and planning in transit system efficiency. In South End, the mixed-use design led to more substantial emission reductions, indicating that urban layout can significantly influence public transit outcomes. Tailored strategies that consider the unique urban design elements are essential for climate resilience. Notably, doubling bus utilization decreased daily emissions by 10.18% in South End and 8.13% in Avondale, with a corresponding reduction in overall traffic. A target of 50% bus utilization saw emissions drop by 21.45% in South End and 14.50% in Avondale. At an idealistic goal of 70% bus utilization, South End and Avondale witnessed emission reductions of 37.22% and 27.80%, respectively. These insights are crucial for urban designers and policymakers in developing sustainable urban landscapes.

CVMar 5
Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Shanle Yao, Armin Danesh Pazho, Narges Rashvand et al.

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

CVAug 19, 2025
A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.

LGJan 26, 2024
Expert with Clustering: Hierarchical Online Preference Learning Framework

Tianyue Zhou, Jung-Hoon Cho, Babak Rahimi Ardabili et al.

Emerging mobility systems are increasingly capable of recommending options to mobility users, to guide them towards personalized yet sustainable system outcomes. Even more so than the typical recommendation system, it is crucial to minimize regret, because 1) the mobility options directly affect the lives of the users, and 2) the system sustainability relies on sufficient user participation. In this study, we consider accelerating user preference learning by exploiting a low-dimensional latent space that captures the mobility preferences of users. We introduce a hierarchical contextual bandit framework named Expert with Clustering (EWC), which integrates clustering techniques and prediction with expert advice. EWC efficiently utilizes hierarchical user information and incorporates a novel Loss-guided Distance metric. This metric is instrumental in generating more representative cluster centroids. In a recommendation scenario with $N$ users, $T$ rounds per user, and $K$ options, our algorithm achieves a regret bound of $O(N\sqrt{T\log K} + NT)$. This bound consists of two parts: the first term is the regret from the Hedge algorithm, and the second term depends on the average loss from clustering. To the best of the authors knowledge, this is the first work to analyze the regret of an integrated expert algorithm with k-Means clustering. This regret bound underscores the theoretical and experimental efficacy of EWC, particularly in scenarios that demand rapid learning and adaptation. Experimental results highlight that EWC can substantially reduce regret by 27.57% compared to the LinUCB baseline. Our work offers a data-efficient approach to capturing both individual and collective behaviors, making it highly applicable to contexts with hierarchical structures. We expect the algorithm to be applicable to other settings with layered nuances of user preferences and information.

CVFeb 1, 2022
ADG-Pose: Automated Dataset Generation for Real-World Human Pose Estimation

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Justin Sanchez et al.

Recent advancements in computer vision have seen a rise in the prominence of applications using neural networks to understand human poses. However, while accuracy has been steadily increasing on State-of-the-Art datasets, these datasets often do not address the challenges seen in real-world applications. These challenges are dealing with people distant from the camera, people in crowds, and heavily occluded people. As a result, many real-world applications have trained on data that does not reflect the data present in deployment, leading to significant underperformance. This article presents ADG-Pose, a method for automatically generating datasets for real-world human pose estimation. These datasets can be customized to determine person distances, crowdedness, and occlusion distributions. Models trained with our method are able to perform in the presence of these challenges where those trained on other datasets fail. Using ADG-Pose, end-to-end accuracy for real-world skeleton-based action recognition sees a 20% increase on scenes with moderate distance and occlusion levels, and a 4X increase on distant scenes where other models failed to perform better than random.

LGAug 1, 2021
DeepTrack: Lightweight Deep Learning for Vehicle Path Prediction in Highways

Vinit Katariya, Mohammadreza Baharani, Nichole Morris et al.

Vehicle trajectory prediction is essential for enabling safety-critical intelligent transportation systems (ITS) applications used in management and operations. While there have been some promising advances in the field, there is a need for modern deep learning algorithms that allow real-time trajectory prediction on embedded IoT devices. This article presents DeepTrack, a novel deep learning algorithm customized for real-time vehicle trajectory prediction and monitoring applications in arterial management, freeway management, traffic incident management, and work zone management for high-speed incoming traffic. In contrast to previous methods, the vehicle dynamics are encoded using Temporal Convolutional Networks (TCNs) to provide more robust time prediction with less computation. DeepTrack also uses depthwise convolution, which reduces the complexity of models compared to existing approaches in terms of model size and operations. Overall, our experimental results demonstrate that DeepTrack achieves comparable accuracy to state-of-the-art trajectory prediction models but with smaller model sizes and lower computational complexity, making it more suitable for real-world deployment.

CVFeb 5, 2021
Single Run Action Detector over Video Stream -- A Privacy Preserving Approach

Anbumalar Saravanan, Justin Sanchez, Hassan Ghasemzadeh et al.

This paper takes initial strides at designing and evaluating a vision-based system for privacy ensured activity monitoring. The proposed technology utilizing Artificial Intelligence (AI)-empowered proactive systems offering continuous monitoring, behavioral analysis, and modeling of human activities. To this end, this paper presents Single Run Action Detector (S-RAD) which is a real-time privacy-preserving action detector that performs end-to-end action localization and classification. It is based on Faster-RCNN combined with temporal shift modeling and segment based sampling to capture the human actions. Results on UCF-Sports and UR Fall dataset present comparable accuracy to State-of-the-Art approaches with significantly lower model size and computation demand and the ability for real-time execution on edge embedded device (e.g. Nvidia Jetson Xavier).

LGNov 10, 2020
ATCN: Resource-Efficient Processing of Time Series on Edge

Mohammadreza Baharani, Hamed Tabkhi

This paper presents a scalable deep learning model called Agile Temporal Convolutional Network (ATCN) for high-accurate fast classification and time series prediction in resource-constrained embedded systems. ATCN is a family of compact networks with formalized hyperparameters that enable application-specific adjustments to be made to the model architecture. It is primarily designed for embedded edge devices with very limited performance and memory, such as wearable biomedical devices and real-time reliability monitoring systems. ATCN makes fundamental improvements over the mainstream temporal convolutional neural networks, including residual connections to increase the network depth and accuracy, and the incorporation of separable depth-wise convolution to reduce the computational complexity of the model. As part of the present work, two ATCN families, namely T0, and T1 are also presented and evaluated on different ranges of embedded processors - Cortex-M7 and Cortex-A57 processor. An evaluation of the ATCN models against the best-in-class InceptionTime and MiniRocket shows that ATCN almost maintains accuracy while improving the execution time on a broad range of embedded and cyber-physical applications with demand for real-time processing on the embedded edge. At the same time, in contrast to existing solutions, ATCN is the first time-series classifier based on deep learning that can be run bare-metal on embedded microcontrollers (Cortex-M7) with limited computational performance and memory capacity while delivering state-of-the-art accuracy.

CVJul 16, 2020
EfficientHRNet: Efficient Scaling for Lightweight High-Resolution Multi-Person Pose Estimation

Christopher Neff, Aneri Sheth, Steven Furgurson et al.

There is an increasing demand for lightweight multi-person pose estimation for many emerging smart IoT applications. However, the existing algorithms tend to have large model sizes and intense computational requirements, making them ill-suited for real-time applications and deployment on resource-constrained hardware. Lightweight and real-time approaches are exceedingly rare and come at the cost of inferior accuracy. In this paper, we present EfficientHRNet, a family of lightweight multi-person human pose estimators that are able to perform in real-time on resource-constrained devices. By unifying recent advances in model scaling with high-resolution feature representations, EfficientHRNet creates highly accurate models while reducing computation enough to achieve real-time performance. The largest model is able to come within 4.4% accuracy of the current state-of-the-art, while having 1/3 the model size and 1/6 the computation, achieving 23 FPS on Nvidia Jetson Xavier. Compared to the top real-time approach, EfficientHRNet increases accuracy by 22% while achieving similar FPS with 1/3 the power. At every level, EfficientHRNet proves to be more computationally efficient than other bottom-up 2D human pose estimation approaches, while achieving highly competitive accuracy.

CVMay 26, 2020
CARPe Posterum: A Convolutional Approach for Real-time Pedestrian Path Prediction

Matías Mendieta, Hamed Tabkhi

Pedestrian path prediction is an essential topic in computer vision and video understanding. Having insight into the movement of pedestrians is crucial for ensuring safe operation in a variety of applications including autonomous vehicles, social robots, and environmental monitoring. Current works in this area utilize complex generative or recurrent methods to capture many possible futures. However, despite the inherent real-time nature of predicting future paths, little work has been done to explore accurate and computationally efficient approaches for this task. To this end, we propose a convolutional approach for real-time pedestrian path prediction, CARPe. It utilizes a variation of Graph Isomorphism Networks in combination with an agile convolutional neural network design to form a fast and accurate path prediction approach. Notable results in both inference speed and prediction accuracy are achieved, improving FPS considerably in comparison to current state-of-the-art methods while delivering competitive accuracy on well-known path prediction datasets.

CVNov 20, 2019
REVAMP$^2$T: Real-time Edge Video Analytics for Multi-camera Privacy-aware Pedestrian Tracking

Christopher Neff, Matías Mendieta, Shrey Mohan et al.

This article presents REVAMP$^2$T, Real-time Edge Video Analytics for Multi-camera Privacy-aware Pedestrian Tracking, as an integrated end-to-end IoT system for privacy-built-in decentralized situational awareness. REVAMP$^2$T presents novel algorithmic and system constructs to push deep learning and video analytics next to IoT devices (i.e. video cameras). On the algorithm side, REVAMP$^2$T proposes a unified integrated computer vision pipeline for detection, re-identification, and tracking across multiple cameras without the need for storing the streaming data. At the same time, it avoids facial recognition, and tracks and re-identifies pedestrians based on their key features at runtime. On the IoT system side, REVAMP$^2$T provides infrastructure to maximize hardware utilization on the edge, orchestrates global communications, and provides system-wide re-identification, without the use of personally identifiable information, for a distributed IoT network. For the results and evaluation, this article also proposes a new metric, Accuracy$\cdot$Efficiency (Æ), for holistic evaluation of IoT systems for real-time video analytics based on accuracy, performance, and power efficiency. REVAMP$^2$T outperforms current state-of-the-art by as much as thirteen-fold Æ~improvement.

LGAug 3, 2019
Real-time Deep Learning at the Edge for Scalable Reliability Modeling of Si-MOSFET Power Electronics Converters

Mohammadreza Baharani, Mehrdad Biglarbegian, Babak Parkhideh et al.

With the significant growth of advanced high-frequency power converters, on-line monitoring and active reliability assessment of power electronic devices are extremely crucial. This article presents a transformative approach, named Deep Learning Reliability Awareness of Converters at the Edge (Deep RACE), for real-time reliability modeling and prediction of high-frequency MOSFET power electronic converters. Deep RACE offers a holistic solution which comprises algorithm advances, and full system integration (from the cloud down to the edge node) to create a near real-time reliability awareness. On the algorithm side, this paper proposes a deep learning algorithmic solution based on stacked LSTM for collective reliability training and inference across collective MOSFET converters based on device resistance changes. Deep RACE also proposes an integrative edge-to-cloud solution to offer a scalable decentralized devices-specific reliability monitoring, awareness, and modeling. The MOSFET convertors are IoT devices which have been empowered with edge real-time deep learning processing capabilities. The proposed Deep RACE solution has been prototyped and implemented through learning from MOSFET data set provided by NASA. Our experimental results show an average miss prediction of $8.9\%$ over five different devices which is a much higher accuracy compared to well-known classical approaches (Kalman Filter, and Particle Filter). Deep RACE only requires $26ms$ processing time and $1.87W$ computing power on Edge IoT device.