Robust Object Detection with Multi-input Multi-output Faster R-CNN
This work addresses robustness for object detection systems, but it is incremental as it extends an existing MIMO approach to a new task.
The paper tackled robust object detection in out-of-distribution settings by applying a multi-input multi-output (MIMO) architecture to Faster R-CNN, achieving competitive accuracy with only two input/output pairs, adding 0.5% parameters and increasing inference time by 15.9%.
Recent years have seen impressive progress in visual recognition on many benchmarks, however, generalization to the real-world in out-of-distribution setting remains a significant challenge. A state-of-the-art method for robust visual recognition is model ensembling. however, recently it was shown that similarly competitive results could be achieved with a much smaller cost, by using multi-input multi-output architecture (MIMO). In this work, a generalization of the MIMO approach is applied to the task of object detection using the general-purpose Faster R-CNN model. It was shown that using the MIMO framework allows building strong feature representation and obtains very competitive accuracy when using just two input/output pairs. Furthermore, it adds just 0.5\% additional model parameters and increases the inference time by 15.9\% when compared to the standard Faster R-CNN. It also works comparably to, or outperforms the Deep Ensemble approach in terms of model accuracy, robustness to out-of-distribution setting, and uncertainty calibration when the same number of predictions is used. This work opens up avenues for applying the MIMO approach in other high-level tasks such as semantic segmentation and depth estimation.