Structured Knowledge Distillation Towards Efficient and Compact Multi-View 3D Detection
This work addresses efficiency issues in 3D object detection for edge devices, representing an incremental improvement through distillation techniques.
The paper tackles the problem of inefficient multi-view 3D detection models by proposing a structured knowledge distillation framework, which improves average performance by 2.16 mAP and 2.27 NDS on the nuScenes benchmark.
Detecting 3D objects from multi-view images is a fundamental problem in 3D computer vision. Recently, significant breakthrough has been made in multi-view 3D detection tasks. However, the unprecedented detection performance of these vision BEV (bird's-eye-view) detection models is accompanied with enormous parameters and computation, which make them unaffordable on edge devices. To address this problem, in this paper, we propose a structured knowledge distillation framework, aiming to improve the efficiency of modern vision-only BEV detection models. The proposed framework mainly includes: (a) spatial-temporal distillation which distills teacher knowledge of information fusion from different timestamps and views, (b) BEV response distillation which distills teacher response to different pillars, and (c) weight-inheriting which solves the problem of inconsistent inputs between students and teacher in modern transformer architectures. Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark, outperforming multiple baselines by a large margin.