CVSep 14, 2023
Vision-based Analysis of Driver Activity and Driving Performance Under the Influence of AlcoholRoss Greer, Akshay Gopalkrishnan, Sumega Mandadi et al.
About 30% of all traffic crash fatalities in the United States involve drunk drivers, making the prevention of drunk driving paramount to vehicle safety in the US and other locations which have a high prevalence of driving while under the influence of alcohol. Driving impairment can be monitored through active use of sensors (when drivers are asked to engage in providing breath samples to a vehicle instrument or when pulled over by a police officer), but a more passive and robust mechanism of sensing may allow for wider adoption and benefit of intelligent systems that reduce drunk driving accidents. This could assist in identifying impaired drivers before they drive, or early in the driving process (before a crash or detection by law enforcement). In this research, we introduce a study which adopts a multi-modal ensemble of visual, thermal, audio, and chemical sensors to (1) examine the impact of acute alcohol administration on driving performance in a driving simulator, and (2) identify data-driven methods for detecting driving under the influence of alcohol. We describe computer vision and machine learning models for analyzing the driver's face in thermal imagery, and introduce a pipeline for training models on data collected from drivers with a range of breath-alcohol content levels, including discussion of relevant machine learning phenomena which can help in future experiment design for related studies.
CVFeb 5, 2024
ActiveAnno3D -- An Active Learning Framework for Multi-Modal 3D Object DetectionAhmed Ghita, Bjørk Antoniussen, Walter Zimmer et al.
The curation of large-scale datasets is still costly and requires much time and resources. Data is often manually labeled, and the challenge of creating high-quality datasets remains. In this work, we fill the research gap using active learning for multi-modal 3D object detection. We propose ActiveAnno3D, an active learning framework to select data samples for labeling that are of maximum informativeness for training. We explore various continuous training methods and integrate the most efficient method regarding computational demand and detection performance. Furthermore, we perform extensive experiments and ablation studies with BEVFusion and PV-RCNN on the nuScenes and TUM Traffic Intersection dataset. We show that we can achieve almost the same performance with PV-RCNN and the entropy-based query strategy when using only half of the training data (77.25 mAP compared to 83.50 mAP) of the TUM Traffic Intersection dataset. BEVFusion achieved an mAP of 64.31 when using half of the training data and 75.0 mAP when using the complete nuScenes dataset. We integrate our active learning framework into the proAnno labeling tool to enable AI-assisted data selection and labeling and minimize the labeling costs. Finally, we provide code, weights, and visualization results on our website: https://active3d-framework.github.io/active3d-framework.
CVJan 30, 2024
The Why, When, and How to Use Active Learning in Large-Data-Driven 3D Object Detection for Safe Autonomous Driving: An Empirical ExplorationRoss Greer, Bjørk Antoniussen, Mathias V. Andersen et al.
Active learning strategies for 3D object detection in autonomous driving datasets may help to address challenges of data imbalance, redundancy, and high-dimensional data. We demonstrate the effectiveness of entropy querying to select informative samples, aiming to reduce annotation costs and improve model performance. We experiment using the BEVFusion model for 3D object detection on the nuScenes dataset, comparing active learning to random sampling and demonstrating that entropy querying outperforms in most cases. The method is particularly effective in reducing the performance gap between majority and minority classes. Class-specific analysis reveals efficient allocation of annotated resources for limited data budgets, emphasizing the importance of selecting diverse and informative data for model training. Our findings suggest that entropy querying is a promising strategy for selecting data that enhances model learning in resource-constrained environments.
CVDec 2, 2021
On Salience-Sensitive Sign Classification in Autonomous Vehicle Path Planning: Experimental Explorations with a Novel DatasetRoss Greer, Jason Isa, Nachiket Deo et al.
Safe path planning in autonomous driving is a complex task due to the interplay of static scene elements and uncertain surrounding agents. While all static scene elements are a source of information, there is asymmetric importance to the information available to the ego vehicle. We present a dataset with a novel feature, sign salience, defined to indicate whether a sign is distinctly informative to the goals of the ego vehicle with regards to traffic regulations. Using convolutional networks on cropped signs, in tandem with experimental augmentation by road type, image coordinates, and planned maneuver, we predict the sign salience property with 76% accuracy, finding the best improvement using information on vehicle maneuver with sign images.
CVJul 27, 2021
Predicting Take-over Time for Autonomous Driving with Real-World Data: Robust Data Augmentation, Models, and EvaluationAkshay Rangesh, Nachiket Deo, Ross Greer et al.
Understanding occupant-vehicle interactions by modeling control transitions is important to ensure safe approaches to passenger vehicle automation. Models which contain contextual, semantically meaningful representations of driver states can be used to determine the appropriate timing and conditions for transfer of control between driver and vehicle. However, such models rely on real-world control take-over data from drivers engaged in distracting activities, which is costly to collect. Here, we introduce a scheme for data augmentation for such a dataset. Using the augmented dataset, we develop and train take-over time (TOT) models that operate sequentially on mid and high-level features produced by computer vision algorithms operating on different driver-facing camera views, showing models trained on the augmented dataset to outperform the initial dataset. The demonstrated model features encode different aspects of the driver state, pertaining to the face, hands, foot and upper body of the driver. We perform ablative experiments on feature combinations as well as model architectures, showing that a TOT model supported by augmented data can be used to produce continuous estimates of take-over times without delay, suitable for complex real-world scenarios.
ROApr 23, 2021
Autonomous Vehicles that Alert Humans to Take-Over Controls: Modeling with Real-World DataAkshay Rangesh, Nachiket Deo, Ross Greer et al.
With increasing automation in passenger vehicles, the study of safe and smooth occupant-vehicle interaction and control transitions is key. In this study, we focus on the development of contextual, semantically meaningful representations of the driver state, which can then be used to determine the appropriate timing and conditions for transfer of control between driver and vehicle. To this end, we conduct a large-scale real-world controlled data study where participants are instructed to take-over control from an autonomous agent under different driving conditions while engaged in a variety of distracting activities. These take-over events are captured using multiple driver-facing cameras, which when labelled result in a dataset of control transitions and their corresponding take-over times (TOTs). We then develop and train TOT models that operate sequentially on mid to high-level features produced by computer vision algorithms operating on different driver-facing camera views. The proposed TOT model produces continuous predictions of take-over times without delay, and shows promising qualitative and quantitative results in complex real-world scenarios.
CVMar 22, 2021
LaneAF: Robust Multi-Lane Detection with Affinity FieldsHala Abualsaud, Sean Liu, David Lu et al.
This study presents an approach to lane detection involving the prediction of binary segmentation masks and per-pixel affinity fields. These affinity fields, along with the binary masks, can then be used to cluster lane pixels horizontally and vertically into corresponding lane instances in a post-processing step. This clustering is achieved through a simple row-by-row decoding process with little overhead; such an approach allows LaneAF to detect a variable number of lanes without assuming a fixed or maximum number of lanes. Moreover, this form of clustering is more interpretable in comparison to previous visual clustering approaches, and can be analyzed to identify and correct sources of error. Qualitative and quantitative results obtained on popular lane detection datasets demonstrate the model's ability to detect and cluster lanes effectively and robustly. Our proposed approach sets a new state-of-the-art on the challenging CULane dataset and the recently introduced Unsupervised LLAMAS dataset.
CVJan 11, 2021
TrackMPNN: A Message Passing Graph Neural Architecture for Multi-Object TrackingAkshay Rangesh, Pranav Maheshwari, Mez Gebre et al.
This study follows many classical approaches to multi-object tracking (MOT) that model the problem using dynamic graphical data structures, and adapts this formulation to make it amenable to modern neural networks. Our main contributions in this work are the creation of a framework based on dynamic undirected graphs that represent the data association problem over multiple timesteps, and a message passing graph neural network (MPNN) that operates on these graphs to produce the desired likelihood for every association therein. We also provide solutions and propositions for the computational problems that need to be addressed to create a memory-efficient, real-time, online algorithm that can reason over multiple timesteps, correct previous mistakes, update beliefs, and handle missed/false detections. To demonstrate the efficacy of our approach, we only use the 2D box location and object category ID to construct the descriptor for each object instance. Despite this, our model performs on par with state-of-the-art approaches that make use of additional sensors, as well as multiple hand-crafted and/or learned features. This illustrates that given the right problem formulation and model design, raw bounding boxes (and their kinematics) from any off-the-shelf detector are sufficient to achieve competitive tracking results on challenging MOT benchmarks.
CVMay 6, 2020
Trajectory Prediction for Autonomous Driving based on Multi-Head Attention with Joint Agent-Map RepresentationKaouther Messaoud, Nachiket Deo, Mohan M. Trivedi et al.
Predicting the trajectories of surrounding agents is an essential ability for autonomous vehicles navigating through complex traffic scenes. The future trajectories of agents can be inferred using two important cues: the locations and past motion of agents, and the static scene structure. Due to the high variability in scene structure and agent configurations, prior work has employed the attention mechanism, applied separately to the scene and agent configuration to learn the most salient parts of both cues. However, the two cues are tightly linked. The agent configuration can inform what part of the scene is most relevant to prediction. The static scene in turn can help determine the relative influence of agents on each other's motion. Moreover, the distribution of future trajectories is multimodal, with modes corresponding to the agent's intent. The agent's intent also informs what part of the scene and agent configuration is relevant to prediction. We thus propose a novel approach applying multi-head attention by considering a joint representation of the static scene and surrounding agents. We use each attention head to generate a distinct future trajectory to address multimodality of future trajectories. Our model achieves state of the art results on the nuScenes prediction benchmark and generates diverse future trajectories compliant with scene structure and agent configuration.
CVFeb 6, 2020
Gaze Preserving CycleGANs for Eyeglass Removal & Persistent Gaze EstimationAkshay Rangesh, Bowen Zhang, Mohan M. Trivedi
A driver's gaze is critical for determining their attention, state, situational awareness, and readiness to take over control from partially automated vehicles. Estimating the gaze direction is the most obvious way to gauge a driver's state under ideal conditions when limited to using non-intrusive imaging sensors. Unfortunately, the vehicular environment introduces a variety of challenges that are usually unaccounted for - harsh illumination, nighttime conditions, and reflective eyeglasses. Relying on head pose alone under such conditions can prove to be unreliable and erroneous. In this study, we offer solutions to address these problems encountered in the real world. To solve issues with lighting, we demonstrate that using an infrared camera with suitable equalization and normalization suffices. To handle eyeglasses and their corresponding artifacts, we adopt image-to-image translation using generative adversarial networks to pre-process images prior to gaze estimation. Our proposed Gaze Preserving CycleGAN (GPCycleGAN) is trained to preserve the driver's gaze while removing potential eyeglasses from face images. GPCycleGAN is based on the well-known CycleGAN approach - with the addition of a gaze classifier and a gaze consistency loss for additional supervision. Our approach exhibits improved performance, interpretability, robustness and superior qualitative results on challenging real-world datasets.
CVJan 3, 2020
Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based PlansNachiket Deo, Mohan M. Trivedi
We address the problem of forecasting pedestrian and vehicle trajectories in unknown environments, conditioned on their past motion and scene structure. Trajectory forecasting is a challenging problem due to the large variation in scene structure and the multimodal distribution of future trajectories. Unlike prior approaches that directly learn one-to-many mappings from observed context to multiple future trajectories, we propose to condition trajectory forecasts on plans sampled from a grid based policy learned using maximum entropy inverse reinforcement learning (MaxEnt IRL). We reformulate MaxEnt IRL to allow the policy to jointly infer plausible agent goals, and paths to those goals on a coarse 2-D grid defined over the scene. We propose an attention based trajectory generator that generates continuous valued future trajectories conditioned on state sequences sampled from the MaxEnt policy. Quantitative and qualitative evaluation on the publicly available Stanford drone and NuScenes datasets shows that our model generates trajectories that are diverse, representing the multimodal predictive distribution, and precise, conforming to the underlying scene structure over long prediction horizons.
HCSep 30, 2019
On Assessing Driver Awareness of Situational Criticalities: Multi-modal Bio-sensing and Vision-based Analysis, Evaluations, and InsightsSiddharth Siddharth, Mohan M. Trivedi
Automobiles for our roadways are increasingly using advanced driver assistance systems. The adoption of such new technologies requires us to develop novel perception systems not only for accurately understanding the situational context of these vehicles, but also to infer the driver's awareness in differentiating between safe and critical situations. This manuscript focuses on the specific problem of inferring driver awareness in the context of attention analysis and hazardous incident activity. Even after the development of wearable and compact multi-modal bio-sensing systems in recent years, their application in driver awareness context has been scarcely explored. The~capability of simultaneously recording different kinds of bio-sensing data in addition to traditionally employed computer vision systems provides exciting opportunities to explore the limitations of these sensor modalities. In this work, we explore the applications of three different bio-sensing modalities namely electroencephalogram (EEG), photoplethysmogram (PPG) and galvanic skin response (GSR) along with a camera-based vision system in driver awareness context. We assess the information from these sensors independently and together using both signal processing- and deep learning-based tools. We show that our methods outperform previously reported studies to classify driver attention and detecting hazardous/non-hazardous situations for short time scales of two seconds. We use EEG and vision data for high resolution temporal classification (two seconds) while additionally also employing PPG and GSR over longer time periods. We evaluate our methods by collecting user data on twelve subjects for two real-world driving datasets among which one is publicly available (KITTI dataset) while the other was collected by us (LISA dataset) with the vehicle ....
CVJul 27, 2019
Forced Spatial Attention for Driver Foot Activity ClassificationAkshay Rangesh, Mohan M. Trivedi
This paper provides a simple solution for reliably solving image classification tasks tied to spatial locations of salient objects in the scene. Unlike conventional image classification approaches that are designed to be invariant to translations of objects in the scene, we focus on tasks where the output classes vary with respect to where an object of interest is situated within an image. To handle this variant of the image classification task, we propose augmenting the standard cross-entropy (classification) loss with a domain dependent Forced Spatial Attention (FSA) loss, which in essence compels the network to attend to specific regions in the image associated with the desired output class. To demonstrate the utility of this loss function, we consider the task of driver foot activity classification - where each activity is strongly correlated with where the driver's foot is in the scene. Training with our proposed loss function results in significantly improved accuracies, better generalization, and robustness against noise, while obviating the need for very large datasets.
ROMay 23, 2019
Scene Induced Multi-Modal Trajectory Forecasting via PlanningNachiket Deo, Mohan M. Trivedi
We address multi-modal trajectory forecasting of agents in unknown scenes by formulating it as a planning problem. We present an approach consisting of three models; a goal prediction model to identify potential goals of the agent, an inverse reinforcement learning model to plan optimal paths to each goal, and a trajectory generator to obtain future trajectories along the planned paths. Analysis of predictions on the Stanford drone dataset, shows generalizability of our approach to novel scenes.
CVMay 14, 2019
Understanding Pedestrian-Vehicle Interactions with Vehicle Mounted Vision: An LSTM Model and Empirical AnalysisDaniela A. Ridel, Nachiket Deo, Denis Wolf et al.
Pedestrians and vehicles often share the road in complex inner city traffic. This leads to interactions between the vehicle and pedestrians, with each affecting the other's motion. In order to create robust methods to reason about pedestrian behavior and to design interfaces of communication between self-driving cars and pedestrians we need to better understand such interactions. In this paper, we present a data-driven approach to implicitly model pedestrians' interactions with vehicles, to better predict pedestrian behavior. We propose a LSTM model that takes as input the past trajectories of the pedestrian and ego-vehicle, and pedestrian head orientation, and predicts the future positions of the pedestrian. Our experiments based on a real-world, inner city dataset captured with vehicle mounted cameras, show that the usage of such cues improve pedestrian prediction when compared to a baseline that purely uses the past trajectory of the pedestrian.
HCMay 1, 2019
Attention Monitoring and Hazard Assessment with Bio-Sensing and Vision: Empirical Analysis Utilizing CNNs on the KITTI DatasetSiddharth, Mohan M. Trivedi
Assessing the driver's attention and detecting various hazardous and non-hazardous events during a drive are critical for driver's safety. Attention monitoring in driving scenarios has mostly been carried out using vision (camera-based) modality by tracking the driver's gaze and facial expressions. It is only recently that bio-sensing modalities such as Electroencephalogram (EEG) are being explored. But, there is another open problem which has not been explored sufficiently yet in this paradigm. This is the detection of specific events, hazardous and non-hazardous, during driving that affects the driver's mental and physiological states. The other challenge in evaluating multi-modal sensory applications is the absence of very large scale EEG data because of the various limitations of using EEG in the real world. In this paper, we use both of the above sensor modalities and compare them against the two tasks of assessing the driver's attention and detecting hazardous vs. non-hazardous driving events. We collect user data on twelve subjects and show how in the absence of very large-scale datasets, we can still use pre-trained deep learning convolution networks to extract meaningful features from both of the above modalities. We used the publicly available KITTI dataset for evaluating our platform and to compare it with previous studies. Finally, we show that the results presented in this paper surpass the previous benchmark set up in the above driver awareness-related applications.
CVNov 16, 2018
Ground Plane Polling for 6DoF Pose Estimation of Objects on the RoadAkshay Rangesh, Mohan M. Trivedi
This paper introduces an approach to produce accurate 3D detection boxes for objects on the ground using single monocular images. We do so by merging 2D visual cues, 3D object dimensions, and ground plane constraints to produce boxes that are robust against small errors and incorrect predictions. First, we train a single-shot convolutional neural network (CNN) that produces multiple visual and geometric cues of interest: 2D bounding boxes, 2D keypoints of interest, coarse object orientations and object dimensions. Subsets of these cues are then used to poll probable ground planes from a pre-computed database of ground planes, to identify the "best fit" plane with highest consensus. Once identified, the "best fit" plane provides enough constraints to successfully construct the desired 3D detection box, without directly predicting the 6DoF pose of the object. The entire ground plane polling (GPP) procedure is constructed as a non-parametrized layer of the CNN that outputs the desired "best fit" plane and the corresponding 3D keypoints, which together define the final 3D bounding box. Doing so allows us to poll thousands of different ground plane configurations without adding considerable overhead, while also creating a single CNN that directly produces the desired output without the need for post processing. We evaluate our method on the 2D detection and orientation estimation benchmark from the challenging KITTI dataset, and provide additional comparisons for 3D metrics of importance. This single-stage, single-pass CNN results in superior localization and orientation estimation compared to more complex and computationally expensive monocular approaches.
CVNov 14, 2018
Looking at the Driver/Rider in Autonomous Vehicles to Predict Take-Over ReadinessNachiket Deo, Mohan M. Trivedi
Continuous estimation the driver's take-over readiness is critical for safe and timely transfer of control during the failure modes of autonomous vehicles. In this paper, we propose a data-driven approach for estimating the driver's take-over readiness based purely on observable cues from in-vehicle vision sensors. We present an extensive naturalistic drive dataset of drivers in a conditionally autonomous vehicle running on Californian freeways. We collect subjective ratings for the driver's take-over readiness from multiple human observers viewing the sensor feed. Analysis of the ratings in terms of intra-class correlation coefficients (ICCs) shows a high degree of consistency in the ratings across raters. We define a metric for the driver's take-over readiness termed the 'Observable Readiness Index (ORI)' based on the ratings. Finally, we propose an LSTM model for continuous estimation of the driver's ORI based on a holistic representation of the driver's state, capturing gaze, hand, pose and foot activity. Our model estimates the ORI with a mean absolute error of 0.449 on a 5 point scale.
CVMay 15, 2018
Convolutional Social Pooling for Vehicle Trajectory PredictionNachiket Deo, Mohan M. Trivedi
Forecasting the motion of surrounding vehicles is a critical ability for an autonomous vehicle deployed in complex traffic. Motion of all vehicles in a scene is governed by the traffic context, i.e., the motion and relative spatial configuration of neighboring vehicles. In this paper we propose an LSTM encoder-decoder model that uses convolutional social pooling as an improvement to social pooling layers for robustly learning interdependencies in vehicle motion. Additionally, our model outputs a multi-modal predictive distribution over future trajectories based on maneuver classes. We evaluate our model using the publicly available NGSIM US-101 and I-80 datasets. Our results show improvement over the state of the art in terms of RMS values of prediction error and negative log-likelihoods of true future trajectories under the model's predictive distribution. We also present a qualitative analysis of the model's predicted distributions for various traffic scenarios.
CVMay 15, 2018
Multi-Modal Trajectory Prediction of Surrounding Vehicles with Maneuver based LSTMsNachiket Deo, Mohan M. Trivedi
To safely and efficiently navigate through complex traffic scenarios, autonomous vehicles need to have the ability to predict the future motion of surrounding vehicles. Multiple interacting agents, the multi-modal nature of driver behavior, and the inherent uncertainty involved in the task make motion prediction of surrounding vehicles a challenging problem. In this paper, we present an LSTM model for interaction aware motion prediction of surrounding vehicles on freeways. Our model assigns confidence values to maneuvers being performed by vehicles and outputs a multi-modal distribution over future motion based on them. We compare our approach with the prior art for vehicle motion prediction on the publicly available NGSIM US-101 and I-80 datasets. Our results show an improvement in terms of RMS values of prediction error. We also present an ablative analysis of the components of our proposed model and analyze the predictions made by the model in complex traffic scenarios.
CVApr 20, 2018
HandyNet: A One-stop Solution to Detect, Segment, Localize & Analyze Driver HandsAkshay Rangesh, Mohan M. Trivedi
Tasks related to human hands have long been part of the computer vision community. Hands being the primary actuators for humans, convey a lot about activities and intents, in addition to being an alternative form of communication/interaction with other humans and machines. In this study, we focus on training a single feedforward convolutional neural network (CNN) capable of executing many hand related tasks that may be of use in autonomous and semi-autonomous vehicles of the future. The resulting network, which we refer to as HandyNet, is capable of detecting, segmenting and localizing (in 3D) driver hands inside a vehicle cabin. The network is additionally trained to identify handheld objects that the driver may be interacting with. To meet the data requirements to train such a network, we propose a method for cheap annotation based on chroma-keying, thereby bypassing weeks of human effort required to label such data. This process can generate thousands of labeled training samples in an efficient manner, and may be replicated in new environments with relative ease.
IVApr 3, 2018
Looking at Hands in Autonomous Vehicles: A ConvNet Approach using Part Affinity FieldsKevan Yuen, Mohan M. Trivedi
In the context of autonomous driving, where humans may need to take over in the event where the computer may issue a takeover request, a key step towards driving safety is the monitoring of the hands to ensure the driver is ready for such a request. This work, focuses on the first step of this process, which is to locate the hands. Such a system must work in real-time and under varying harsh lighting conditions. This paper introduces a fast ConvNet approach, based on the work of original work of OpenPose for full body joint estimation. The network is modified with fewer parameters and retrained using our own day-time naturalistic autonomous driving dataset to estimate joint and affinity heatmaps for driver & passenger's wrist and elbows, for a total of 8 joint classes and part affinity fields between each wrist-elbow pair. The approach runs real-time on real-world data at 40 fps on multiple drivers and passengers. The system is extensively evaluated both quantitatively and qualitatively, showing at least 95% detection performance on joint localization and arm-angle estimation.
CVFeb 23, 2018
No Blind Spots: Full-Surround Multi-Object Tracking for Autonomous Vehicles using Cameras & LiDARsAkshay Rangesh, Mohan M. Trivedi
Online multi-object tracking (MOT) is extremely important for high-level spatial reasoning and path planning for autonomous and highly-automated vehicles. In this paper, we present a modular framework for tracking multiple objects (vehicles), capable of accepting object proposals from different sensor modalities (vision and range) and a variable number of sensors, to produce continuous object tracks. This work is a generalization of the MDP framework for MOT, with some key extensions - First, we track objects across multiple cameras and across different sensor modalities. This is done by fusing object proposals across sensors accurately and efficiently. Second, the objects of interest (targets) are tracked directly in the real world. This is a departure from traditional techniques where objects are simply tracked in the image plane. Doing so allows the tracks to be readily used by an autonomous agent for navigation and related tasks. To verify the effectiveness of our approach, we test it on real world highway data collected from a heavily sensorized testbed capable of capturing full-surround information. We demonstrate that our framework is well-suited to track objects through entire maneuvers around the ego-vehicle, some of which take more than a few minutes to complete. We also leverage the modularity of our approach by comparing the effects of including/excluding different sensors, changing the total number of sensors, and the quality of object proposals on the final tracking result.
CVFeb 22, 2018
Driver Hand Localization and Grasp Analysis: A Vision-based Real-time ApproachSiddharth, Akshay Rangesh, Eshed Ohn-Bar et al.
Extracting hand regions and their grasp information from images robustly in real-time is critical for occupants' safety and in-vehicular infotainment applications. It must however, be noted that naturalistic driving scenes suffer from rapidly changing illumination and occlusion. This is aggravated by the fact that hands are highly deformable objects, and change in appearance frequently. This work addresses the task of accurately localizing driver hands and classifying the grasp state of each hand. We use a fast ConvNet to first detect likely hand regions. Next, a pixel-based skin classifier that takes into account the global illumination changes is used to refine the hand detections and remove false positives. This step generates a pixel-level mask for each hand. Finally, we study each such masked regions and detect if the driver is grasping the wheel, or in some cases a mobile phone. Through evaluation we demonstrate that our method can outperform state-of-the-art pixel based hand detectors, while running faster (at 35 fps) than other deep ConvNet based frameworks even for grasp analysis. Hand mask cues are shown to be crucial when analyzing a set of driver hand gestures (wheel/mobile phone grasp and no-grasp) in naturalistic driving settings. The proposed detection and localization pipeline hence can act as a general framework for real-time hand detection and gesture classification.
CVFeb 8, 2018
Driver Gaze Zone Estimation using Convolutional Neural Networks: A General Framework and Ablative AnalysisSourabh Vora, Akshay Rangesh, Mohan M. Trivedi
Driver gaze has been shown to be an excellent surrogate for driver attention in intelligent vehicles. With the recent surge of highly autonomous vehicles, driver gaze can be useful for determining the handoff time to a human driver. While there has been significant improvement in personalized driver gaze zone estimation systems, a generalized system which is invariant to different subjects, perspectives and scales is still lacking. We take a step towards this generalized system using Convolutional Neural Networks (CNNs). We finetune 4 popular CNN architectures for this task, and provide extensive comparisons of their outputs. We additionally experiment with different input image patches, and also examine how image size affects performance. For training and testing the networks, we collect a large naturalistic driving dataset comprising of 11 long drives, driven by 10 subjects in two different cars. Our best performing model achieves an accuracy of 95.18% during cross-subject testing, outperforming current state of the art techniques for this task. Finally, we evaluate our best performing model on the publicly available Columbia Gaze Dataset comprising of images from 56 subjects with varying head pose and gaze directions. Without any training, our model successfully encodes the different gaze directions on this diverse dataset, demonstrating good generalization capabilities.
CVFeb 5, 2018
An Occluded Stacked Hourglass Approach to Facial Landmark Localization and Occlusion EstimationKevan Yuen, Mohan M. Trivedi
A key step to driver safety is to observe the driver's activities with the face being a key step in this process to extracting information such as head pose, blink rate, yawns, talking to passenger which can then help derive higher level information such as distraction, drowsiness, intent, and where they are looking. In the context of driving safety, it is important for the system perform robust estimation under harsh lighting and occlusion but also be able to detect when the occlusion occurs so that information predicted from occluded parts of the face can be taken into account properly. This paper introduces the Occluded Stacked Hourglass, based on the work of original Stacked Hourglass network for body pose joint estimation, which is retrained to process a detected face window and output 68 occlusion heat maps, each corresponding to a facial landmark. Landmark location, occlusion levels and a refined face detection score, to reject false positives, are extracted from these heat maps. Using the facial landmark locations, features such as head pose and eye/mouth openness can be extracted to derive driver attention and activity. The system is evaluated for face detection, head pose, and occlusion estimation on various datasets in the wild, both quantitatively and qualitatively, and shows state-of-the-art results.
CVJan 31, 2018
Dynamics of Driver's Gaze: Explorations in Behavior Modeling & Maneuver PredictionSujitha Martin, Sourabh Vora, Kevan Yuen et al.
The study and modeling of driver's gaze dynamics is important because, if and how the driver is monitoring the driving environment is vital for driver assistance in manual mode, for take-over requests in highly automated mode and for semantic perception of the surround in fully autonomous mode. We developed a machine vision based framework to classify driver's gaze into context rich zones of interest and model driver's gaze behavior by representing gaze dynamics over a time period using gaze accumulation, glance duration and glance frequencies. As a use case, we explore the driver's gaze dynamic patterns during maneuvers executed in freeway driving, namely, left lane change maneuver, right lane change maneuver and lane keeping. It is shown that condensing gaze dynamics into durations and frequencies leads to recurring patterns based on driver activities. Furthermore, modeling these patterns show predictive powers in maneuver detection up to a few hundred milliseconds a priori.
CVJan 24, 2018
When Vehicles See Pedestrians with Phones:A Multi-Cue Framework for Recognizing Phone-based Activities of PedestriansAkshay Rangesh, Mohan M. Trivedi
The intelligent vehicle community has devoted considerable efforts to model driver behavior, and in particular to detect and overcome driver distraction in an effort to reduce accidents caused by driver negligence. However, as the domain increasingly shifts towards autonomous and semi-autonomous solutions, the driver is no longer integral to the decision making process, indicating a need to refocus efforts elsewhere. To this end, we propose to study pedestrian distraction instead. In particular, we focus on detecting pedestrians who are engaged in secondary activities involving their cellphones and similar handheld multimedia devices from a purely vision-based standpoint. To achieve this objective, we propose a pipeline incorporating articulated human pose estimation, followed by a soft object label transfer from an ensemble of exemplar SVMs trained on the nearest neighbors in pose feature space. We additionally incorporate head gaze features and prior pose information to carry out cellphone related pedestrian activity recognition. Finally, we offer a method to reliably track the articulated pose of a pedestrian through a sequence of images using a particle filter with a Gaussian Process Dynamical Model (GPDM), which can then be used to estimate sequentially varying activity scores at a very low computational cost. The entire framework is fast (especially for sequential data) and accurate, and easily extensible to include other secondary activities and sources of distraction.
CVJan 19, 2018
How would surround vehicles move? A Unified Framework for Maneuver Classification and Motion PredictionNachiket Deo, Akshay Rangesh, Mohan M. Trivedi
Reliable prediction of surround vehicle motion is a critical requirement for path planning for autonomous vehicles. In this paper we propose a unified framework for surround vehicle maneuver classification and motion prediction that exploits multiple cues, namely, the estimated motion of vehicles, an understanding of typical motion patterns of freeway traffic and inter-vehicle interaction. We report our results in terms of maneuver classification accuracy and mean and median absolute error of predicted trajectories against the ground truth for real traffic data collected using vehicle mounted sensors on freeways. An ablative analysis is performed to analyze the relative importance of each cue for trajectory prediction. Additionally, an analysis of execution time for the components of the framework is presented. Finally, we present multiple case studies analyzing the outputs of our model for complex traffic scenarios
CVSep 21, 2017
A Multimodal, Full-Surround Vehicular Testbed for Naturalistic Studies and Benchmarking: Design, Calibration and DeploymentAkshay Rangesh, Kevan Yuen, Ravi Kumar Satzoda et al.
Recent progress in autonomous and semi-autonomous driving has been made possible in part through an assortment of sensors that provide the intelligent agent with an enhanced perception of its surroundings. It has been clear for quite some while now that for intelligent vehicles to function effectively in all situations and conditions, a fusion of different sensor technologies is essential. Consequently, the availability of synchronized multi-sensory data streams are necessary to promote the development of fusion based algorithms for low, mid and high level semantic tasks. In this paper, we provide a comprehensive description of LISA-A: our heavily sensorized, full-surround testbed capable of providing high quality data from a slew of synchronized and calibrated sensors such as cameras, LIDARs, radars, and the IMU/GPS. The vehicle has recorded over 100 hours of real world data for a very diverse set of weather, traffic and daylight conditions. All captured data is accurately calibrated and synchronized using timestamps, and stored safely in high performance servers mounted inside the vehicle itself. Details on the testbed instrumentation, sensor layout, sensor outputs, calibration and synchronization are described in this paper.
CVJan 6, 2017
To Boost or Not to Boost? On the Limits of Boosted Trees for Object DetectionEshed Ohn-Bar, Mohan M. Trivedi
We aim to study the modeling limitations of the commonly employed boosted decision trees classifier. Inspired by the success of large, data-hungry visual recognition models (e.g. deep convolutional neural networks), this paper focuses on the relationship between modeling capacity of the weak learners, dataset size, and dataset properties. A set of novel experiments on the Caltech Pedestrian Detection benchmark results in the best known performance among non-CNN techniques while operating at fast run-time speed. Furthermore, the performance is on par with deep architectures (9.71% log-average miss rate), while using only HOG+LUV channels as features. The conclusions from this study are shown to generalize over different object detection domains as demonstrated on the FDDB face detection benchmark (93.37% accuracy). Despite the impressive performance, this study reveals the limited modeling capacity of the common boosted trees model, motivating a need for architectural changes in order to compete with multi-level and very deep architectures.
CVMar 12, 2015
Learning to Detect Vehicles by Clustering Appearance PatternsEshed Ohn-Bar, Mohan M. Trivedi
This paper studies efficient means for dealing with intra-category diversity in object detection. Strategies for occlusion and orientation handling are explored by learning an ensemble of detection models from visual and geometrical clusters of object instances. An AdaBoost detection scheme is employed with pixel lookup features for fast detection. The analysis provides insight into the design of a robust vehicle detection system, showing promise in terms of detection performance and orientation estimation accuracy.