Goldie Nejat

RO
h-index37
9papers
123citations
Novelty48%
AI Score39

9 Papers

ROMar 1, 2022
Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Angus Fung, Beno Benhabib, Goldie Nejat

Robotic detection of people in crowded and/or cluttered human-centered environments including hospitals, long-term care, stores and airports is challenging as people can become occluded by other people or objects, and deform due to variations in clothing or pose. There can also be loss of discriminative visual features due to poor lighting. In this paper, we present a novel multimodal person detection architecture to address the mobile robot problem of person detection under intraclass variations. We present a two-stage training approach using 1) a unique pretraining method we define as Temporal Invariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal Faster R-CNN (MFRCNN) detector. TimCLR learns person representations that are invariant under intraclass variations through unsupervised learning. Our approach is unique in that it generates image pairs from natural variations within multimodal image sequences, in addition to synthetic data augmentation, and contrasts crossmodal features to transfer invariances between different modalities. These pretrained features are used by the MFRCNN detector for finetuning and person detection from RGB-D images. Extensive experiments validate the performance of our DL architecture in both human-centered crowded and cluttered environments. Results show that our method outperforms existing unimodal and multimodal person detection approaches in terms of detection accuracy in detecting people with body occlusions and pose deformations in different lighting conditions.

54.2ROApr 7
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions

Souren Pashangpour, Haitong Wang, Matthew Lisondra et al.

Mobile manipulators are increasingly deployed in human-centered environments to perform tasks. While completing such tasks, they should also be able to communicate their intent to the people around them using expressive robot behaviors. Prior work on expressive robot behaviors has used preprogrammed or learning-from-demonstration- based expressive motions and large language model generated high-level interactions. The majority of these existing approaches have not considered human-robot interactions (HRI) where users may interrupt, modify, or redirect a robot's actions during task execution. In this paper, we develop the novel ExpressMM framework that integrates a high-level language-guided planner based on a vision-language model for perception and conversational reasoning with a low-level vision-language-action policy to generate expressive robot behaviors during collaborative HRI tasks. Furthermore, ExpressMM supports interruptible interactions to accommodate updated or redirecting instructions by users. We demonstrate ExpressMM on a mobile manipulator assisting a human in a collaborative assembly scenario and conduct audience-based evaluation of live HRI demonstrations. Questionnaire results show that the ExpressMM-enabled expressive behaviors helped observers clearly interpret the robot's actions and intentions while supporting socially appropriate and understandable interactions. Participants also reported that the robot was useful for collaborative tasks and behaved in a predictable and safe manner during the demonstrations, fostering positive perceptions of the robot's usefulness, safety, and predictability during the collaborative tasks.

RONov 5, 2024
The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

Souren Pashangpour, Goldie Nejat

The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health specific LLM based robots in terms of multi modal communication through human robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.

ROJan 31, 2025
Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach

Aaron Hao Tan, Angus Fung, Haitong Wang et al.

Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and efficient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This paper introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-Nav integrates a unique Selective Visual Association Prompting approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes compared against a non-hand-drawn map approach.

CVFeb 13, 2024
LDTrack: Dynamic People Tracking by Service Robots using Diffusion Models

Angus Fung, Beno Benhabib, Goldie Nejat

Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (LDTrack), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial-temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of LDTrack over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance. Additionally, a comprehensive multi-object tracking comparison study was performed against the state-of-the-art methods in urban environments, demonstrating the generalizability of LDTrack. An ablation study was performed to validate the design choices of LDTrack.

RONov 27, 2024
MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

Angus Fung, Aaron Hao Tan, Haitong Wang et al.

Robotic search of people in human-centered environments, including healthcare settings, is challenging as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person's plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints by a topological graph and regions by semantic labels. This is incorporated into a MLLM with a region planner that selects the next search region based on the semantic relevance to the search scenario, and a waypoint planner which generates a search path by considering the semantically relevant objects and the local spatial context through our unique spatial chain-of-thought prompting approach. Extensive 3D photorealistic experiments were conducted to validate the performance of MLLM-Search in searching for a person with a changing schedule in different environments. An ablation study was also conducted to validate the main design choices of MLLM-Search. Furthermore, a comparison study with state-of-the art search methods demonstrated that MLLM-Search outperforms existing methods with respect to search efficiency. Real-world experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to finding a person in a new unseen environment.

ROOct 5, 2021
Deep Reinforcement Learning for Decentralized Multi-Robot Exploration With Macro Actions

Aaron Hao Tan, Federico Pizarro Bejarano, Yuhan Zhu et al.

Cooperative multi-robot teams need to be able to explore cluttered and unstructured environments while dealing with communication dropouts that prevent them from exchanging local information to maintain team coordination. Therefore, robots need to consider high-level teammate intentions during action selection. In this letter, we present the first Macro Action Decentralized Exploration Network (MADE-Net) using multi-agent deep reinforcement learning (DRL) to address the challenges of communication dropouts during multi-robot exploration in unseen, unstructured, and cluttered environments. Simulated robot team exploration experiments were conducted and compared against classical and DRL methods where MADE-Net outperformed all benchmark methods in terms of computation time, total travel distance, number of local interactions between robots, and exploration rate across various degrees of communication dropouts. A scalability study in 3D environments showed a decrease in exploration time with MADE-Net with increasing team and environment sizes. The experiments presented highlight the effectiveness and robustness of our method.

LGSep 3, 2021
Multimodal Detection of COVID-19 Symptoms using Deep Learning & Probability-based Weighting of Modes

Meysam Effati, Yu-Chen Sun, Hani E. Naguib et al.

The COVID-19 pandemic is one of the most challenging healthcare crises during the 21st century. As the virus continues to spread on a global scale, the majority of efforts have been on the development of vaccines and the mass immunization of the public. While the daily case numbers were following a decreasing trend, the emergent of new virus mutations and variants still pose a significant threat. As economies start recovering and societies start opening up with people going back into office buildings, schools, and malls, we still need to have the ability to detect and minimize the spread of COVID-19. Individuals with COVID-19 may show multiple symptoms such as cough, fever, and shortness of breath. Many of the existing detection techniques focus on symptoms having the same equal importance. However, it has been shown that some symptoms are more prevalent than others. In this paper, we present a multimodal method to predict COVID-19 by incorporating existing deep learning classifiers using convolutional neural networks and our novel probability-based weighting function that considers the prevalence of each symptom. The experiments were performed on an existing dataset with respect to the three considered modes of coughs, fever, and shortness of breath. The results show considerable improvements in the detection of COVID-19 using our weighting function when compared to an equal weighting function.

CVDec 15, 2020
Robots Understanding Contextual Information in Human-Centered Environments using Weakly Supervised Mask Data Distillation

Daniel Dworakowski, Goldie Nejat

Contextual information in human environments, such as signs, symbols, and objects provide important information for robots to use for exploration and navigation. To identify and segment contextual information from complex images obtained in these environments, data-driven methods such as Convolutional Neural Networks (CNNs) are used. However, these methods require large amounts of human labeled data which are slow and time-consuming to obtain. Weakly supervised methods address this limitation by generating pseudo segmentation labels (PSLs). In this paper, we present the novel Weakly Supervised Mask Data Distillation (WeSuperMaDD) architecture for autonomously generating PSLs using CNNs not specifically trained for the task of context segmentation; i.e., CNNs trained for object classification, image captioning, etc. WeSuperMaDD uniquely generates PSLs using learned image features from sparse and limited diversity data; common in robot navigation tasks in human-centred environments (malls, grocery stores). Our proposed architecture uses a new mask refinement system which automatically searches for the PSL with the fewest foreground pixels that satisfies cost constraints. This removes the need for handcrafted heuristic rules. Extensive experiments successfully validated the performance of WeSuperMaDD in generating PSLs for datasets with text of various scales, fonts, and perspectives in multiple indoor/outdoor environments. A comparison with Naive, GrabCut, and Pyramid methods found a significant improvement in label and segmentation quality. Moreover, a context segmentation CNN trained using the WeSuperMaDD architecture achieved measurable improvements in accuracy compared to one trained with Naive PSLs. Our method also had comparable performance to existing state-of-the-art text detection and segmentation methods on real datasets without requiring segmentation labels for training.