Lex Fridman

CV
21papers
1,257citations
Novelty40%
AI Score26

21 Papers

HCJun 26, 2023
CLERA: A Unified Model for Joint Cognitive Load and Eye Region Analysis in the Wild

Li Ding, Jack Terwilliger, Aishni Parab et al. · mit

Non-intrusive, real-time analysis of the dynamics of the eye region allows us to monitor humans' visual attention allocation and estimate their mental state during the performance of real-world tasks, which can potentially benefit a wide range of human-computer interaction (HCI) applications. While commercial eye-tracking devices have been frequently employed, the difficulty of customizing these devices places unnecessary constraints on the exploration of more efficient, end-to-end models of eye dynamics. In this work, we propose CLERA, a unified model for Cognitive Load and Eye Region Analysis, which achieves precise keypoint detection and spatiotemporal tracking in a joint-learning framework. Our method demonstrates significant efficiency and outperforms prior work on tasks including cognitive load estimation, eye landmark detection, and blink estimation. We also introduce a large-scale dataset of 30k human faces with joint pupil, eye-openness, and landmark annotation, which aims to support future HCI research on human factors and eye-related analysis.

CVJul 25, 2019
Object as Distribution

Li Ding, Lex Fridman

Object detection is a critical part of visual scene understanding. The representation of the object in the detection task has important implications on the efficiency and feasibility of annotation, robustness to occlusion, pose, lighting, and other visual sources of semantic uncertainty, and effectiveness in real-world applications (e.g., autonomous driving). Popular object representations include 2D and 3D bounding boxes, polygons, splines, pixels, and voxels. Each have their strengths and weakness. In this work, we propose a new representation of objects based on the bivariate normal distribution. This distribution-based representation has the benefit of robust detection of highly-overlapping objects and the potential for improved downstream tracking and instance segmentation tasks due to the statistical representation of object edges. We provide qualitative evaluation of this representation for the object detection task and quantitative evaluation of its use in a baseline algorithm for the instance segmentation task.

HCApr 8, 2019
Dynamics of Pedestrian Crossing Decisions Based on Vehicle Trajectories in Large-Scale Simulated and Real-World Data

Jack Terwilliger, Michael Glazer, Henri Schmidt et al.

Humans, as both pedestrians and drivers, generally skillfully navigate traffic intersections. Despite the uncertainty, danger, and the non-verbal nature of communication commonly found in these interactions, there are surprisingly few collisions considering the total number of interactions. As the role of automation technology in vehicles grows, it becomes increasingly critical to understand the relationship between pedestrian and driver behavior: how pedestrians perceive the actions of a vehicle/driver and how pedestrians make crossing decisions. The relationship between time-to-arrival (TTA) and pedestrian gap acceptance (i.e., whether a pedestrian chooses to cross under a given window of time to cross) has been extensively investigated. However, the dynamic nature of vehicle trajectories in the context of non-verbal communication has not been systematically explored. Our work provides evidence that trajectory dynamics, such as changes in TTA, can be powerful signals in the non-verbal communication between drivers and pedestrians. Moreover, we investigate these effects in both simulated and real-world datasets, both larger than have previously been considered in literature to the best of our knowledge.

HCApr 8, 2019
Eye Contact Between Pedestrians and Drivers

Dina AlAdawy, Michael Glazer, Jack Terwilliger et al.

When asked, a majority of people believe that, as pedestrians, they make eye contact with the driver of an approaching vehicle when making their crossing decisions. This work presents evidence that this widely held belief is false. We do so by showing that, in majority of cases where conflict is possible, pedestrians begin crossing long before they are able to see the driver through the windshield. In other words, we are able to circumvent the very difficult question of whether pedestrians choose to make eye contact with drivers, by showing that whether they think they do or not, they can't. Specifically, we show that over 90\% of people in representative lighting conditions cannot determine the gaze of the driver at 15m and see the driver at all at 30m. This means that, for example, that given the common city speed limit of 25mph, more than 99% of pedestrians would have begun crossing before being able to see either the driver or the driver's gaze. In other words, from the perspective of the pedestrian, in most situations involving an approaching vehicle, the crossing decision is made by the pedestrian solely based on the kinematics of the vehicle without needing to determine that eye contact was made by explicitly detecting the eyes of the driver.

HCApr 2, 2019
Hacking Nonverbal Communication Between Pedestrians and Vehicles in Virtual Reality

Henri Schmidt, Jack Terwilliger, Dina AlAdawy et al.

We use an immersive virtual reality environment to explore the intricate social cues that underlie non-verbal communication involved in a pedestrian's crossing decision. We "hack" non-verbal communication between pedestrian and vehicle by engineering a set of 15 vehicle trajectories, some of which follow social conventions and some that break them. By subverting social expectations of vehicle behavior we show that pedestrians may use vehicle kinematics to infer social intentions and not merely as the state of a moving object. We investigate human behavior in this virtual world by conducting a study of 22 subjects, with each subject experiencing and responding to each of the trajectories by moving their body, legs, arms, and head in both the physical and the virtual world. Both quantitative and qualitative responses are collected and analyzed, showing that, in fact, social cues can be engineered through vehicle trajectory manipulation. In addition, we demonstrate that immersive virtual worlds which allow the pedestrian to move around freely, provide a powerful way to understand both the mechanisms of human perception and the social signaling involved in pedestrian-vehicle interaction.

CVMar 21, 2019
Value of Temporal Dynamics Information in Driving Scene Segmentation

Li Ding, Jack Terwilliger, Rini Sherony et al.

Semantic scene segmentation has primarily been addressed by forming representations of single images both with supervised and unsupervised methods. The problem of semantic segmentation in dynamic scenes has begun to recently receive attention with video object segmentation approaches. What is not known is how much extra information the temporal dynamics of the visual scene carries that is complimentary to the information available in the individual frames of the video. There is evidence that the human visual system can effectively perceive the scene from temporal dynamics information of the scene's changing visual characteristics without relying on the visual characteristics of individual snapshots themselves. Our work takes steps to explore whether machine perception can exhibit similar properties by combining appearance-based representations and temporal dynamics representations in a joint-learning problem that reveals the contribution of each toward successful dynamic scene segmentation. Additionally, we provide the MIT Driving Scene Segmentation dataset, which is a large-scale full driving scene segmentation dataset, densely annotated for every pixel and every one of 5,000 video frames. This dataset is intended to help further the exploration of the value of temporal dynamics information for semantic segmentation in video.

AIOct 3, 2018
Human-Centered Autonomous Vehicle Systems: Principles of Effective Shared Autonomy

Lex Fridman

Building effective, enjoyable, and safe autonomous vehicles is a lot harder than has historically been considered. The reason is that, simply put, an autonomous vehicle must interact with human beings. This interaction is not a robotics problem nor a machine learning problem nor a psychology problem nor an economics problem nor a policy problem. It is all of these problems put into one. It challenges our assumptions about the limitations of human beings at their worst and the capabilities of artificial intelligence systems at their best. This work proposes a set of principles for designing and building autonomous vehicles in a human-centered way that does not run away from the complexity of human nature but instead embraces it. We describe our development of the Human-Centered Autonomous Vehicle (HCAV) as an illustrative case study of implementing these principles in practice.

HCMay 8, 2018
Designing Toward Minimalism in Vehicle HMI

Julia Kindelsberger, Lex Fridman, Michael Glazer et al.

We propose that safe, beautiful, fulfilling vehicle HMI design must start from a rigorous consideration of minimalist design. Modern vehicles are changing from mechanical machines to mobile computing devices, similar to the change from landline phones to smartphones. We propose the approach of "designing toward minimalism", where we ask "why?" rather than "why not?" in choosing what information to display to the driver. We demonstrate this approach on an HMI case study of displaying vehicle speed. We first show that vehicle speed is what 87.6% of people ask for. We then show, through an online study with 1,038 subjects and 22,950 videos, that humans can estimate ego-vehicle speed very well, especially at lower speeds. Thus, despite believing that we need this information, we may not. In this way, we demonstrate a systematic approach of questioning the fundamental assumptions of what information is essential for vehicle HMI.

NEJan 9, 2018
DeepTraffic: Crowdsourced Hyperparameter Tuning of Deep Reinforcement Learning Systems for Multi-Agent Dense Traffic Navigation

Lex Fridman, Jack Terwilliger, Benedikt Jenik

We present a traffic simulation named DeepTraffic where the planning systems for a subset of the vehicles are handled by a neural network as part of a model-free, off-policy reinforcement learning process. The primary goal of DeepTraffic is to make the hands-on study of deep reinforcement learning accessible to thousands of students, educators, and researchers in order to inspire and fuel the exploration and evaluation of deep Q-learning network variants and hyperparameter configurations through large-scale, open competition. This paper investigates the crowd-sourced hyperparameter tuning of the policy network that resulted from the first iteration of the DeepTraffic competition where thousands of participants actively searched through the hyperparameter space.

CYNov 19, 2017
MIT Advanced Vehicle Technology Study: Large-Scale Naturalistic Driving Study of Driver Behavior and Interaction with Automation

Lex Fridman, Daniel E. Brown, Michael Glazer et al.

For the foreseeble future, human beings will likely remain an integral part of the driving task, monitoring the AI system as it performs anywhere from just over 0% to just under 100% of the driving. The governing objectives of the MIT Autonomous Vehicle Technology (MIT-AVT) study are to (1) undertake large-scale real-world driving data collection that includes high-definition video to fuel the development of deep learning based internal and external perception systems, (2) gain a holistic understanding of how human beings interact with vehicle automation technology by integrating video data with vehicle state data, driver characteristics, mental models, and self-reported experiences with technology, and (3) identify how technology and other factors related to automation adoption and use can be improved in ways that save lives. In pursuing these objectives, we have instrumented 23 Tesla Model S and Model X vehicles, 2 Volvo S90 vehicles, 2 Range Rover Evoque, and 2 Cadillac CT6 vehicles for both long-term (over a year per driver) and medium term (one month per driver) naturalistic driving data collection. Furthermore, we are continually developing new methods for analysis of the massive-scale dataset collected from the instrumented vehicle fleet. The recorded data streams include IMU, GPS, CAN messages, and high-definition video streams of the driver face, the driver cabin, the forward roadway, and the instrument cluster (on select vehicles). The study is on-going and growing. To date, we have 122 participants, 15,610 days of participation, 511,638 miles, and 7.1 billion video frames. This paper presents the design of the study, the data collection hardware, the processing of the data, and the computer vision algorithms currently being used to extract actionable knowledge from the data.

AIOct 12, 2017
Arguing Machines: Human Supervision of Black Box AI Systems That Make Life-Critical Decisions

Lex Fridman, Li Ding, Benedikt Jenik et al.

We consider the paradigm of a black box AI system that makes life-critical decisions. We propose an "arguing machines" framework that pairs the primary AI system with a secondary one that is independently trained to perform the same task. We show that disagreement between the two systems, without any knowledge of underlying system design or operation, is sufficient to arbitrarily improve the accuracy of the overall decision pipeline given human supervision over disagreements. We demonstrate this system in two applications: (1) an illustrative example of image classification and (2) on large-scale real-world semi-autonomous driving data. For the first application, we apply this framework to image classification achieving a reduction from 8.0% to 2.8% top-5 error on ImageNet. For the second application, we apply this framework to Tesla Autopilot and demonstrate the ability to predict 90.4% of system disengagements that were labeled by human annotators as challenging and needing human supervision.

HCJul 10, 2017
To Walk or Not to Walk: Crowdsourced Assessment of External Vehicle-to-Pedestrian Displays

Lex Fridman, Bruce Mehler, Lei Xia et al.

Researchers, technology reviewers, and governmental agencies have expressed concern that automation may necessitate the introduction of added displays to indicate vehicle intent in vehicle-to-pedestrian interactions. An automated online methodology for obtaining communication intent perceptions for 30 external vehicle-to-pedestrian display concepts was implemented and tested using Amazon Mechanic Turk. Data from 200 qualified participants was quickly obtained and processed. In addition to producing a useful early-stage evaluation of these specific design concepts, the test demonstrated that the methodology is scalable so that a large number of design elements or minor variations can be assessed through a series of runs even on much larger samples in a matter of hours. Using this approach, designers should be able to refine concepts both more quickly and in more depth than available development resources typically allow. Some concerns and questions about common assumptions related to the implementation of vehicle-to-pedestrian displays are posed.

NEJun 14, 2017
SideEye: A Generative Neural Network Based Simulator of Human Peripheral Vision

Lex Fridman, Benedikt Jenik, Shaiyan Keshvari et al.

Foveal vision makes up less than 1% of the visual field. The other 99% is peripheral vision. Precisely what human beings see in the periphery is both obvious and mysterious in that we see it with our own eyes but can't visualize what we see, except in controlled lab experiments. Degradation of information in the periphery is far more complex than what might be mimicked with a radial blur. Rather, behaviorally-validated models hypothesize that peripheral vision measures a large number of local texture statistics in pooling regions that overlap and grow with eccentricity. In this work, we develop a new method for peripheral vision simulation by training a generative neural network on a behaviorally-validated full-field synthesis model. By achieving a 21,000 fold reduction in running time, our approach is the first to combine realism and speed of peripheral vision simulation to a degree that provides a whole new way to approach visual design: through peripheral visualization.

CVDec 3, 2016
Semi-Automated Annotation of Discrete States in Large Video Datasets

Lex Fridman, Bryan Reimer

We propose a framework for semi-automated annotation of video frames where the video is of an object that at any point in time can be labeled as being in one of a finite number of discrete states. A Hidden Markov Model (HMM) is used to model (1) the behavior of the underlying object and (2) the noisy observation of its state through an image processing algorithm. The key insight of this approach is that the annotation of frame-by-frame video can be reduced from a problem of labeling every single image to a problem of detecting a transition between states of the underlying objected being recording on video. The performance of the framework is evaluated on a driver gaze classification dataset composed of 16,000,000 images that were fully annotated over 6,000 hours of direct manual annotation labor. On this dataset, we achieve a 13x reduction in manual annotation for an average accuracy of 99.1% and a 84x reduction for an average accuracy of 91.2%.

CVNov 26, 2016
What Can Be Predicted from Six Seconds of Driver Glances?

Lex Fridman, Heishiro Toyoda, Sean Seaman et al.

We consider a large dataset of real-world, on-road driving from a 100-car naturalistic study to explore the predictive power of driver glances and, specifically, to answer the following question: what can be predicted about the state of the driver and the state of the driving environment from a 6-second sequence of macro-glances? The context-based nature of such glances allows for application of supervised learning to the problem of vision-based gaze estimation, making it robust, accurate, and reliable in messy, real-world conditions. So, it's valuable to ask whether such macro-glances can be used to infer behavioral, environmental, and demographic variables? We analyze 27 binary classification problems based on these variables. The takeaway is that glance can be used as part of a multi-sensor real-time system to predict radio-tuning, fatigue state, failure to signal, talking, and several environment variables.

LGNov 22, 2015
Detecting Road Surface Wetness from Audio: A Deep Learning Approach

Irman Abdić, Lex Fridman, Erik Marchi et al.

We introduce a recurrent neural network architecture for automated road surface wetness detection from audio of tire-surface interaction. The robustness of our approach is evaluated on 785,826 bins of audio that span an extensive range of vehicle speeds, noises from the environment, road surface types, and pavement conditions including international roughness index (IRI) values from 25 in/mi to 1400 in/mi. The training and evaluation of the model are performed on different roads to minimize the impact of environmental and other external factors on the accuracy of the classification. We achieve an unweighted average recall (UAR) of 93.2% across all vehicle speeds including 0 mph. The classifier still works at 0 mph because the discriminating signal is present in the sound of other vehicles driving by.

LGNov 12, 2015
Learning Human Identity from Motion Patterns

Natalia Neverova, Christian Wolf, Griffin Lacey et al.

We present a large-scale study exploring the capability of temporal deep neural networks to interpret natural human kinematics and introduce the first method for active biometric authentication with mobile inertial sensors. At Google, we have created a first-of-its-kind dataset of human movements, passively collected by 1500 volunteers using their smartphones daily over several months. We (1) compare several neural architectures for efficient learning of temporal multi-modal data representations, (2) propose an optimized shift-invariant dense convolutional mechanism (DCWRNN), and (3) incorporate the discriminatively-trained dynamic features in a probabilistic generative framework taking into account temporal characteristics. Our results demonstrate that human kinematics convey important information about user identity and can serve as a valuable component of multi-modal authentication systems.

ROOct 21, 2015
Automated Synchronization of Driving Data Using Vibration and Steering Events

Lex Fridman, Daniel E Brown, William Angell et al.

We propose a method for automated synchronization of vehicle sensors useful for the study of multi-modal driver behavior and for the design of advanced driver assistance systems. Multi-sensor decision fusion relies on synchronized data streams in (1) the offline supervised learning context and (2) the online prediction context. In practice, such data streams are often out of sync due to the absence of a real-time clock, use of multiple recording devices, or improper thread scheduling and data buffer management. Cross-correlation of accelerometer, telemetry, audio, and dense optical flow from three video sensors is used to achieve an average synchronization error of 13 milliseconds. The insight underlying the effectiveness of the proposed approach is that the described sensors capture overlapping aspects of vehicle vibrations and vehicle steering allowing the cross-correlation function to serve as a way to compute the delay shift in each sensor. Furthermore, we show the decrease in synchronization error as a function of the duration of the data stream.

CVAug 17, 2015
Owl and Lizard: Patterns of Head Pose and Eye Pose in Driver Gaze Classification

Lex Fridman, Joonbum Lee, Bryan Reimer et al.

Accurate, robust, inexpensive gaze tracking in the car can help keep a driver safe by facilitating the more effective study of how to improve (1) vehicle interfaces and (2) the design of future Advanced Driver Assistance Systems. In this paper, we estimate head pose and eye pose from monocular video using methods developed extensively in prior work and ask two new interesting questions. First, how much better can we classify driver gaze using head and eye pose versus just using head pose? Second, are there individual-specific gaze strategies that strongly correlate with how much gaze classification improves with the addition of eye pose information? We answer these questions by evaluating data drawn from an on-road study of 40 drivers. The main insight of the paper is conveyed through the analogy of an "owl" and "lizard" which describes the degree to which the eyes and the head move when shifting gaze. When the head moves a lot ("owl"), not much classification improvement is attained by estimating eye pose on top of head pose. On the other hand, when the head stays still and only the eyes move ("lizard"), classification accuracy increases significantly from adding in eye pose. We characterize how that accuracy varies between people, gaze strategies, and gaze regions.

CVJul 16, 2015
Driver Gaze Region Estimation Without Using Eye Movement

Lex Fridman, Philipp Langhans, Joonbum Lee et al.

Automated estimation of the allocation of a driver's visual attention may be a critical component of future Advanced Driver Assistance Systems. In theory, vision-based tracking of the eye can provide a good estimate of gaze location. In practice, eye tracking from video is challenging because of sunglasses, eyeglass reflections, lighting conditions, occlusions, motion blur, and other factors. Estimation of head pose, on the other hand, is robust to many of these effects, but cannot provide as fine-grained of a resolution in localizing the gaze. However, for the purpose of keeping the driver safe, it is sufficient to partition gaze into regions. In this effort, we propose a system that extracts facial features and classifies their spatial configuration into six regions in real-time. Our proposed method achieves an average accuracy of 91.4% at an average decision rate of 11 Hz on a dataset of 50 drivers from an on-road study.

CRMar 29, 2015
Active Authentication on Mobile Devices via Stylometry, Application Usage, Web Browsing, and GPS Location

Lex Fridman, Steven Weber, Rachel Greenstadt et al.

Active authentication is the problem of continuously verifying the identity of a person based on behavioral aspects of their interaction with a computing device. In this study, we collect and analyze behavioral biometrics data from 200subjects, each using their personal Android mobile device for a period of at least 30 days. This dataset is novel in the context of active authentication due to its size, duration, number of modalities, and absence of restrictions on tracked activity. The geographical colocation of the subjects in the study is representative of a large closed-world environment such as an organization where the unauthorized user of a device is likely to be an insider threat: coming from within the organization. We consider four biometric modalities: (1) text entered via soft keyboard, (2) applications used, (3) websites visited, and (4) physical location of the device as determined from GPS (when outdoors) or WiFi (when indoors). We implement and test a classifier for each modality and organize the classifiers as a parallel binary decision fusion architecture. We are able to characterize the performance of the system with respect to intruder detection time and to quantify the contribution of each modality to the overall performance.