Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications
This addresses object detection challenges in cluttered transportation scenes, offering incremental improvements through query adaptation and attention mechanisms.
The paper tackled occlusions, fine-grained localization, and computational inefficiency in transformer-based object detectors for transportation applications by proposing DAMM, a framework with multi-modal queries and dual-stream attention, achieving state-of-the-art performance in average precision and recall on four benchmarks.
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.