Structure Optimization for Deep Multimodal Fusion Networks using Graph-Induced Kernels
This work addresses the non-trivial engineering effort in designing optimal fusion architectures for multimodal recognition, which is incremental as it builds on existing Bayesian optimization methods with a new kernel approach.
The paper tackles the problem of optimizing fusion structures in deep multimodal networks by treating it as a discrete optimization problem under Bayesian optimization, proposing a novel graph-induced kernel to compute structural similarities, and demonstrates effectiveness on two challenging human activity recognition datasets.
A popular testbed for deep learning has been multimodal recognition of human activity or gesture involving diverse inputs such as video, audio, skeletal pose and depth images. Deep learning architectures have excelled on such problems due to their ability to combine modality representations at different levels of nonlinear feature extraction. However, designing an optimal architecture in which to fuse such learned representations has largely been a non-trivial human engineering effort. We treat fusion structure optimization as a hyper-parameter search and cast it as a discrete optimization problem under the Bayesian optimization framework. We propose a novel graph-induced kernel to compute structural similarities in the search space of tree-structured multimodal architectures and demonstrate its effectiveness using two challenging multimodal human activity recognition datasets.