CLIP-RLDrive: Human-Aligned Autonomous Driving via CLIP-Based Reward Shaping in Reinforcement Learning
This work addresses the problem of reward design in autonomous driving for researchers and practitioners, though it appears incremental as it applies an existing method (CLIP) to a new domain.
The paper tackles the challenge of designing suitable reward models for reinforcement learning in autonomous driving by introducing CLIP-RLDrive, which uses CLIP-based reward shaping to align vehicle decisions with human-like preferences, resulting in improved performance in complex urban scenarios like unsignalized intersections.
This paper presents CLIP-RLDrive, a new reinforcement learning (RL)-based framework for improving the decision-making of autonomous vehicles (AVs) in complex urban driving scenarios, particularly in unsignalized intersections. To achieve this goal, the decisions for AVs are aligned with human-like preferences through Contrastive Language-Image Pretraining (CLIP)-based reward shaping. One of the primary difficulties in RL scheme is designing a suitable reward model, which can often be challenging to achieve manually due to the complexity of the interactions and the driving scenarios. To deal with this issue, this paper leverages Vision-Language Models (VLMs), particularly CLIP, to build an additional reward model based on visual and textual cues.