Kevin Yu

RO
h-index53
20papers
26,782citations
Novelty41%
AI Score46

20 Papers

AIDec 21, 2024
OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer et al. · openai

The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

CLMar 15, 2023
GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal et al. · berkeley, deepmind

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

CLDec 19, 2025
OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman et al. · berkeley, mila

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

AIJul 8, 2024
Double-Ended Synthesis Planning with Goal-Constrained Bidirectional Search

Kevin Yu, Jihye Roh, Ziang Li et al.

Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of synthesis planning with starting material constraints. Under this formulation, we propose Double-Ended Synthesis Planning (DESP), a novel CASP algorithm under a bidirectional graph search scheme that interleaves expansions from the target and from the goal starting materials to ensure constraint satisfiability. The search algorithm is guided by a goal-conditioned cost network learned offline from a partially observed hypergraph of valid chemical reactions. We demonstrate the utility of DESP in improving solve rates and reducing the number of search expansions by biasing synthesis planning towards expert goals on multiple new benchmarks. DESP can make use of existing one-step retrosynthesis models, and we anticipate its performance to scale as these one-step model capabilities improve.

LGOct 30, 2025
Remote Labor Index: Measuring AI Automation of Remote Work

Mantas Mazeika, Alice Gatti, Cristina Menghini et al.

AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

CLOct 24, 2024Code
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Shuhao Gu, Jialing Zhang, Siyuan Zhou et al.

Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.

AIJan 3, 2025Code
ASKCOS: an open source software suite for synthesis planning

Zhengkai Tu, Sourabh J. Choure, Mun Hong Fong et al.

The advancement of machine learning and the availability of large-scale reaction datasets have accelerated the development of data-driven models for computer-aided synthesis planning (CASP) in the past decade. Here, we detail the newest version of ASKCOS, an open source software suite for synthesis planning that makes available several research advances in a freely available, practical tool. Four one-step retrosynthesis models form the basis of both interactive planning and automatic planning modes. Retrosynthetic planning is complemented by other modules for feasibility assessment and pathway evaluation, including reaction condition recommendation, reaction outcome prediction, and auxiliary capabilities such as solubility prediction and quantum mechanical descriptor prediction. ASKCOS has assisted hundreds of medicinal, synthetic, and process chemists in their day-to-day tasks, complementing expert decision making. It is our belief that CASP tools like ASKCOS are an important part of modern chemistry research, and that they offer ever-increasing utility and accessibility.

ROOct 30, 2019Code
Crop Height and Plot Estimation for Phenotyping from Unmanned Aerial Vehicles using 3D LiDAR

Harnaik Dhami, Kevin Yu, Tianshu Xu et al.

We present techniques to measure crop heights using a 3D Light Detection and Ranging (LiDAR) sensor mounted on an Unmanned Aerial Vehicle (UAV). Knowing the height of plants is crucial to monitor their overall health and growth cycles, especially for high-throughput plant phenotyping. We present a methodology for extracting plant heights from 3D LiDAR point clouds, specifically focusing on plot-based phenotyping environments. We also present a toolchain that can be used to create phenotyping farms for use in Gazebo simulations. The tool creates a randomized farm with realistic 3D plant and terrain models. We conducted a series of simulations and hardware experiments in controlled and natural settings. Our algorithm was able to estimate the plant heights in a field with 112 plots with a root mean square error (RMSE) of 6.1 cm. This is the first such dataset for 3D LiDAR from an airborne robot over a wheat field. The developed simulation toolchain, algorithmic implementation, and datasets can be found on the GitHub repository located at https://github.com/hsd1121/PointCloudProcessing.

LGApr 9, 2025
Resource-efficient Inference with Foundation Model Programs

Lunyiu Nie, Zhimin Ding, Kevin Yu et al.

The inference-time resource costs of large language and vision models present a growing challenge in production deployments. We propose the use of foundation model programs, i.e., programs that can invoke foundation models with varying resource costs and performance, as an approach to this problem. Specifically, we present a method that translates a task into a program, then learns a policy for resource allocation that, on each input, selects foundation model "backends" for each program module. The policy uses smaller, cheaper backends to handle simpler subtasks, while allowing more complex subtasks to leverage larger, more capable models. We evaluate the method on two new "streaming" visual question-answering tasks in which a system answers a question on a sequence of inputs, receiving ground-truth feedback after each answer. Compared to monolithic multi-modal models, our implementation achieves up to 98% resource savings with minimal accuracy loss, demonstrating its potential for scalable and resource-efficient multi-modal inference.

RODec 9, 2020
GATSBI: An Online GTSP-Based Algorithm for Targeted Surface Bridge Inspection

Harnaik Dhami, Kevin Yu, Troi Williams et al.

We study the problem of visual surface inspection of a bridge for defects using an Unmanned Aerial Vehicle (UAV). We do not assume that the geometric model of the bridge is known beforehand. Our planner, termed GATSBI, plans a path in a receding horizon fashion to inspect all points on the surface of the bridge. The input to GATSBI consists of a 3D occupancy map created online with LiDAR scans. Occupied voxels corresponding to the bridge in this map are semantically segmented and used to create a bridge-only occupancy map. Inspecting a bridge voxel requires the UAV to take images from a desired viewing angle and distance. We then create a Generalized Traveling Salesperson Problem (GTSP) instance to cluster candidate viewpoints for inspecting the bridge voxels and use an off-the-shelf GTSP solver to find the optimal path for the given instance. As the algorithm sees more parts of the environment over time, it replans the path to inspect novel parts of the bridge while avoiding obstacles. We evaluate the performance of our algorithm through high-fidelity simulations conducted in AirSim and real-world experiments. We compare the performance of GATSBI with a classical exploration algorithm. Our evaluation reveals that targeting the inspection to only the segmented bridge voxels and planning carefully using a GTSP solver leads to a more efficient and thorough inspection than the baseline algorithm.

ROJul 7, 2020
Coverage of an Environment Using Energy-Constrained Unmanned Aerial Vehicles

Kevin Yu, Jason M. O'Kane, Pratap Tokekar

We study the problem of covering an environment using an Unmanned Aerial Vehicle (UAV) with limited battery capacity. We consider a scenario where the UAV can land on an Unmanned Ground Vehicle (UGV) and recharge the onboard battery. The UGV can also recharge the UAV while transporting the UAV to the next take-off site. We present an algorithm to solve a new variant of the area coverage problem that takes into account this symbiotic UAV and UGV system. The input consists of a set of boustrophedon cells -- rectangular strips whose width is equal to the field-of-view of the sensor on the UAV. The goal is to find a coordinated strategy for the UAV and UGV that visits and covers all cells in minimum time, while optimally finding how much to recharge, where to recharge, and when to recharge the battery. This includes flight time for visiting and covering all cells, recharging time, as well as the take-off and landing times. We show how to reduce this problem to a known NP-hard problem, Generalized Traveling Salesperson Problem (GTSP). Given an optimal GTSP solver, our approach finds the optimal coverage paths for the UAV and UGV. Our formulation models multi-rotor UAVs as well as hybrid UAVs that can operate in fixed-wing and Vertical Take-off and Landing modes. We evaluate our algorithm through simulations and proof-of-concept experiments.

HCJul 6, 2020
Augment Yourself: Mixed Reality Self-Augmentation Using Optical See-through Head-mounted Displays and Physical Mirrors

Mathias Unberath, Kevin Yu, Roghayeh Barmaki et al.

Optical see-though head-mounted displays (OST HMDs) are one of the key technologies for merging virtual objects and physical scenes to provide an immersive mixed reality (MR) environment to its user. A fundamental limitation of HMDs is, that the user itself cannot be augmented conveniently as, in casual posture, only the distal upper extremities are within the field of view of the HMD. Consequently, most MR applications that are centered around the user, such as virtual dressing rooms or learning of body movements, cannot be realized with HMDs. In this paper, we propose a novel concept and prototype system that combines OST HMDs and physical mirrors to enable self-augmentation and provide an immersive MR environment centered around the user. Our system, to the best of our knowledge the first of its kind, estimates the user's pose in the virtual image generated by the mirror using an RGBD camera attached to the HMD and anchors virtual objects to the reflection rather than the user directly. We evaluate our system quantitatively with respect to calibration accuracy and infrared signal degradation effects due to the mirror, and show its potential in applications where large mirrors are already an integral part of the facility. Particularly, we demonstrate its use for virtual fitting rooms, gaming applications, anatomy learning, and personal fitness. In contrast to competing devices such as LCD-equipped smart mirrors, the proposed system consists of only an HMD with RGBD camera and, thus, does not require a prepared environment making it very flexible and generic. In future work, we will aim to investigate how the system can be optimally used for physical rehabilitation and personal training as a promising application.

ROApr 15, 2020
Combining Geometric and Information-Theoretic Approaches for Multi-Robot Exploration

Aravind Preshant Premkumar, Kevin Yu, Pratap Tokekar

We present an algorithm to explore an orthogonal polygon using a team of $p$ robots. This algorithm combines ideas from information-theoretic exploration algorithms and computational geometry based exploration algorithms. We show that the exploration time of our algorithm is competitive (as a function of $p$) with respect to the offline optimal exploration algorithm. The algorithm is based on a single-robot polygon exploration algorithm, a tree exploration algorithm for higher level planning and a submodular orienteering algorithm for lower level planning. We discuss how this strategy can be adapted to real-world settings to deal with noisy sensors. In addition to theoretical analysis, we investigate the performance of our algorithm through simulations for multiple robots and experiments with a single robot.

CYDec 17, 2019
Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?

R. Stuart Geiger, Kevin Yu, Yanlai Yang et al.

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

ROOct 3, 2019
View Planning and Navigation Algorithms for Autonomous Bridge Inspection with UAVs

Kevin Yu, Prajwal Shanthakumar, Jonah Orevillo et al.

We study the problem of infrastructure inspection using an Unmanned Aerial Vehicle (UAV) in box girder bridge environments. We consider a scenario where the UAV needs to fully inspect box girder bridges and localize along the bridge surface when standard methods like GPS and optical flow are denied. Our method for overcoming the difficulties of box girder bridges consist of creating local navigation routines, a supervisor, and a planner. The local navigation routines use two 2D Lidars for girder and column flight. For switching between local navigation routines we implement a supervisor which dictates when the UAV is able to switch between local navigation routines. Lastly, we implement a planner to calculate the path along that box girder bridge that will minimize the flight time of the UAV. With local navigation routines, a supervisor, and a planner we construct a system that can fully and autonomously inspect box girder bridges when standard methods are unavailable.

IVApr 25, 2019
Improved visible to IR image transformation using synthetic data augmentation with cycle-consistent adversarial networks

Kyongsik Yun, Kevin Yu, Joseph Osborne et al.

Infrared (IR) images are essential to improve the visibility of dark or camouflaged objects. Object recognition and segmentation based on a neural network using IR images provide more accuracy and insight than color visible images. But the bottleneck is the amount of relevant IR images for training. It is difficult to collect real-world IR images for special purposes, including space exploration, military and fire-fighting applications. To solve this problem, we created color visible and IR images using a Unity-based 3D game editor. These synthetically generated color visible and IR images were used to train cycle consistent adversarial networks (CycleGAN) to convert visible images to IR images. CycleGAN has the advantage that it does not require precisely matching visible and IR pairs for transformation training. In this study, we discovered that additional synthetic data can help improve CycleGAN performance. Neural network training using real data (N = 20) performed more accurate transformations than training using real (N = 10) and synthetic (N = 10) data combinations. The result indicates that the synthetic data cannot exceed the quality of the real data. Neural network training using real (N = 10) and synthetic (N = 100) data combinations showed almost the same performance as training using real data (N = 20). At least 10 times more synthetic data than real data is required to achieve the same performance. In summary, CycleGAN is used with synthetic data to improve the IR image conversion performance of visible images.

HCApr 17, 2019
Collaboration Analysis Using Deep Learning

Zhang Guo, Kevin Yu, Rebecca Pearlman et al.

The analysis of the collaborative learning process is one of the growing fields of education research, which has many different analytic solutions. In this paper, we provided a new solution to improve automated collaborative learning analyses using deep neural networks. Instead of using self-reported questionnaires, which are subject to bias and noise, we automatically extract group-working information by object recognition results using Mask R-CNN method. This process is based on detecting the people and other objects from pictures and video clips of the collaborative learning process, then evaluate the mobile learning performance using the collaborative indicators. We tested our approach to automatically evaluate the group-work collaboration in a controlled study of thirty-three dyads while performing an anatomy body painting intervention. The results indicate that our approach recognizes the differences of collaborations among teams of treatment and control groups in the case study. This work introduces new methods for automated quality prediction of collaborations among human-human interactions using computer vision techniques.

CVJan 4, 2018
On-the-fly Augmented Reality for Orthopaedic Surgery Using a Multi-Modal Fiducial

Sebastian Andress, Alex Johnson, Mathias Unberath et al.

Fluoroscopic X-ray guidance is a cornerstone for percutaneous orthopaedic surgical procedures. However, two-dimensional observations of the three-dimensional anatomy suffer from the effects of projective simplification. Consequently, many X-ray images from various orientations need to be acquired for the surgeon to accurately assess the spatial relations between the patient's anatomy and the surgical tools. In this paper, we present an on-the-fly surgical support system that provides guidance using augmented reality and can be used in quasi-unprepared operating rooms. The proposed system builds upon a multi-modality marker and simultaneous localization and mapping technique to co-calibrate an optical see-through head mounted display to a C-arm fluoroscopy system. Then, annotations on the 2D X-ray images can be rendered as virtual objects in 3D providing surgical guidance. We quantitatively evaluate the components of the proposed system, and finally, design a feasibility study on a semi-anthropomorphic phantom. The accuracy of our system was comparable to the traditional image-guided technique while substantially reducing the number of acquired X-ray images as well as procedure time. Our promising results encourage further research on the interaction between virtual and real objects, that we believe will directly benefit the proposed method. Further, we would like to explore the capabilities of our on-the-fly augmented reality support system in a larger study directed towards common orthopaedic interventions.

OHOct 2, 2017
Technical Note: Towards Virtual Monitors for Image Guided Interventions - Real-time Streaming to Optical See-Through Head-Mounted Displays

Long Qian, Mathias Unberath, Kevin Yu et al.

Purpose: Image guidance is crucial for the success of many interventions. Images are displayed on designated monitors that cannot be positioned optimally due to sterility and spatial constraints. This indirect visualization causes potential occlusion, hinders hand-eye coordination, leads to increased procedure duration and surgeon load. Methods: We propose a virtual monitor system that displays medical images in a mixed reality visualization using optical see-through head-mounted displays. The system streams high-resolution medical images from any modality to the head-mounted display in real-time that are blended with the surgical site. It allows for mixed reality visualization of images in head-, world-, or body-anchored mode and can thus be adapted to specific procedural needs. Results: For typical image sizes, the proposed system exhibits an average end-to-end delay and refresh rate of 214 +- 30 ms and 41:4 +- 32:0 Hz, respectively. Conclusions: The proposed virtual monitor system is capable of real-time mixed reality visualization of medical images. In future, we seek to conduct first pre-clinical studies to quantitatively assess the impact of the system on standard image guided procedures.

ROMar 31, 2017
Algorithms for Routing of Unmanned Aerial Vehicles with Mobile Recharging Stations

Kevin Yu, Ashish Kumar Budhiraja, Pratap Tokekar

We study the problem of planning a tour for an energy-limited Unmanned Aerial Vehicle (UAV) to visit a set of sites in the least amount of time. We envision scenarios where the UAV can be recharged along the way either by landing on stationary recharging stations or on Unmanned Ground Vehicles (UGVs) acting as mobile recharging stations. This leads to a new variant of the Traveling Salesperson Problem (TSP) with mobile recharging stations. We present an algorithm that finds not only the order in which to visit the sites but also when and where to land on the charging stations to recharge. Our algorithm plans tours for the UGVs as well as determines best locations to place stationary charging stations. While the problems we study are NP-Hard, we present a practical solution using Generalized TSP that finds the optimal solution. If the UGVs are slower, the algorithm also finds the minimum number of UGVs required to support the UAV mission such that the UAV is not required to wait for the UGV. Our simulation results show that the running time is acceptable for reasonably sized instances in practice.