CVNov 26, 2018

Cross-domain Deep Feature Combination for Bird Species Classification with Audio-visual Data

arXiv:1811.10199v121 citations
Originality Incremental advance
AI Analysis

This work addresses a domain-specific problem in bird species classification by incrementally improving accuracy through multimodal data fusion.

The paper tackled bird species classification by combining visual and audio data using CNN-based multimodal fusion strategies, showing that the combined model outperforms single-modality models and that transfer learning improves performance.

In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that We can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes