Multi-View Networks For Multi-Channel Audio Classification
This addresses the challenge of deploying multi-channel audio models in real-world scenarios with variable sensor setups, though it appears incremental as it builds on existing multi-view or multi-channel methods.
The paper tackles the problem of sound classification with multiple sensors by introducing multi-view networks that can handle arbitrary and dynamically changing numbers of input channels without performance degradation, demonstrating generalization to unseen channel counts and room geometries in both anechoic and simulated environments.
In this paper we introduce the idea of multi-view networks for sound classification with multiple sensors. We show how one can build a multi-channel sound recognition model trained on a fixed number of channels, and deploy it to scenarios with arbitrary (and potentially dynamically changing) number of input channels and not observe degradation in performance. We demonstrate that at inference time you can safely provide this model all available channels as it can ignore noisy information and leverage new information better than standard baseline approaches. The model is evaluated in both an anechoic environment and in rooms generated by a room acoustics simulator. We demonstrate that this model can generalize to unseen numbers of channels as well as unseen room geometries.