CV AIMay 11

Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

arXiv:2605.1011714.9

Predicted impact top 93% in CV · last 90 daysOriginality Highly original

AI Analysis

For autonomous driving, this work addresses computational inefficiency and occlusion tracking, achieving strong gains in latency and long-tail performance.

Enhanced HOPE adapts computation per LiDAR frame based on geometric complexity, reducing latency by 38% on simple scenes with no accuracy loss, improving mAP by 2.7 points on rare scenarios, and tracking objects through occlusions over 5 seconds where baselines fail.

Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.

View on arXiv PDF

Similar