CV ROAug 19, 2023

Towards a High-Performance Object Detector: Insights from Drone Detection Using ViT and CNN-based Deep Learning Models

arXiv:2308.09899v21.510 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses drone detection for applications like collision avoidance and defense, but it is incremental as it compares existing ViT and CNN methods on a specific dataset.

The paper tackled drone detection using Vision Transformers (ViT) and CNNs on a dataset of 1359 drone photos, finding that a basic ViT achieved performance 4.6 times more robust than the best CNN-based model for single-drone detection, and multi-drone detection with YOLO v7 and YOLOS reached 98% and 96% mAP, respectively.

Accurate drone detection is strongly desired in drone collision avoidance, drone defense and autonomous Unmanned Aerial Vehicle (UAV) self-landing. With the recent emergence of the Vision Transformer (ViT), this critical task is reassessed in this paper using a UAV dataset composed of 1359 drone photos. We construct various CNN and ViT-based models, demonstrating that for single-drone detection, a basic ViT can achieve performance 4.6 times more robust than our best CNN-based transfer learning models. By implementing the state-of-the-art You Only Look Once (YOLO v7, 200 epochs) and the experimental ViT-based You Only Look At One Sequence (YOLOS, 20 epochs) in multi-drone detection, we attain impressive 98% and 96% mAP values, respectively. We find that ViT outperforms CNN at the same epoch, but also requires more training data, computational power, and sophisticated, performance-oriented designs to fully surpass the capabilities of cutting-edge CNN detectors. We summarize the distinct characteristics of ViT and CNN models to aid future researchers in developing more efficient deep learning models.

View on arXiv PDF

Similar