CVNov 25, 2025Code
WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous DrivingSeungjun Yu, Seonho Lee, Namho Kim et al.
Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents. Our code and data are provided in https://github.com/sjyu001/WaymoQA
CVJun 13, 2024Code
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QAJongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya et al.
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature leverage large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Motivated by this inefficiency, we propose LVNet, a modular and training-free framework featuring a novel Hierarchical Keyframe Selector (HKS) that efficiently selects a minimal set of informative frames tailored to each question. LVNet's modularity allows easy integration with existing approaches for more efficient LVQA. We achieve state-of-the-art performance among similarly configured models across four benchmark LVQA datasets: EgoSchema, NExT-QA, IntentQA, VideoMME. The code can be found at https://github.com/jongwoopark7978/LVNet
RONov 28, 2025
SUPER-AD: Semantic Uncertainty-aware Planning for End-to-End Robust Autonomous DrivingWonjeong Ryu, Seungjun Yu, Seokha Moon et al.
End-to-End (E2E) planning has become a powerful paradigm for autonomous driving, yet current systems remain fundamentally uncertainty-blind. They assume perception outputs are fully reliable, even in ambiguous or poorly observed scenes, leaving the planner without an explicit measure of uncertainty. To address this limitation, we propose a camera-only E2E framework that estimates aleatoric uncertainty directly in BEV space and incorporates it into planning. Our method produces a dense, uncertainty-aware drivability map that captures both semantic structure and geometric layout at pixel-level resolution. To further promote safe and rule-compliant behavior, we introduce a lane-following regularization that encodes lane structure and traffic norms. This prior stabilizes trajectory planning under normal conditions while preserving the flexibility needed for maneuvers such as overtaking or lane changes. Together, these components enable robust and interpretable trajectory planning, even under challenging uncertainty conditions. Evaluated on the NAVSIM benchmark, our method achieves state-of-the-art performance, delivering substantial gains on both the challenging NAVHARD and NAVSAFE subsets. These results demonstrate that our principled aleatoric uncertainty modeling combined with driving priors significantly advances the safety and reliability of camera-only E2E autonomous driving.
CVNov 27, 2025
Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy OptimizationInha Kang, Eunki Kim, Wonjeong Ryu et al.
Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non-real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios.