RO LGApr 11, 2025

Offline Reinforcement Learning using Human-Aligned Reward Labeling for Autonomous Emergency Braking in Occluded Pedestrian Crossing

Vinal Asodia, Zhenhua Feng, Saber Fallah

arXiv:2504.08704v15.72 citationsh-index: 23

Originality Incremental advance

AI Analysis

This addresses the challenge of training autonomous driving systems with human-aligned safety considerations, though it is incremental as it builds on existing offline RL methods.

The paper tackled the problem of lacking reward labels in real-world driving datasets for offline reinforcement learning by developing a pipeline to generate human-aligned reward labels, and the results showed that these labels closely matched simulation rewards and produced competitive policy performance in occluded pedestrian crossing scenarios.

Effective leveraging of real-world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables the training of autonomous vehicles using such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel pipeline for generating human-aligned reward labels. The proposed approach addresses the challenge of absent reward signals in real-world datasets by generating labels that reflect human judgment and safety considerations. The pipeline incorporates an adaptive safety component, activated by analyzing semantic segmentation maps, allowing the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed pipeline is applied to an occluded pedestrian crossing scenario with varying levels of pedestrian traffic, using synthetic and simulation data. The results indicate that the generated reward labels closely match the simulation reward labels. When used to train the driving policy using Behavior Proximal Policy Optimisation, the results are competitive with other baselines. This demonstrates the effectiveness of our method in producing reliable and human-aligned reward signals, facilitating the training of autonomous driving systems through Reinforcement Learning outside of simulation environments and in alignment with human values.

View on arXiv PDF

Similar