SD AI ASAug 25, 2022

Spatio-Temporal Representation Learning Enhanced Source Cell-phone Recognition from Speech Recordings

Chunyan Zeng, Shixiong Feng, Zhifeng Wang, Xiangkui Wan, Yunfan Chen, Nan Zhao

arXiv:2208.12753v18.314 citationsh-index: 21

Originality Incremental advance

AI Analysis

This addresses the need for accurate device identification in forensic or security applications, but it appears incremental as it builds on existing recognition methods with specific improvements.

The paper tackles the problem of source cell-phone recognition from speech recordings by proposing a spatio-temporal representation learning method, achieving an average accuracy of 99.03% for closed-set recognition of 45 cell-phones and 98.18% in small sample size experiments, outperforming existing state-of-the-art methods.

The existing source cell-phone recognition method lacks the long-term feature characterization of the source device, resulting in inaccurate representation of the source cell-phone related features which leads to insufficient recognition accuracy. In this paper, we propose a source cell-phone recognition method based on spatio-temporal representation learning, which includes two main parts: extraction of sequential Gaussian mean matrix features and construction of a recognition model based on spatio-temporal representation learning. In the feature extraction part, based on the analysis of time-series representation of recording source signals, we extract sequential Gaussian mean matrix with long-term and short-term representation ability by using the sensitivity of Gaussian mixture model to data distribution. In the model construction part, we design a structured spatio-temporal representation learning network C3D-BiLSTM to fully characterize the spatio-temporal information, combine 3D convolutional network and bidirectional long short-term memory network for short-term spectral information and long-time fluctuation information representation learning, and achieve accurate recognition of cell-phones by fusing spatio-temporal feature information of recording source signals. The method achieves an average accuracy of 99.03% for the closed-set recognition of 45 cell-phones under the CCNU\_Mobile dataset, and 98.18% in small sample size experiments, with recognition performance better than the existing state-of-the-art methods. The experimental results show that the method exhibits excellent recognition performance in multi-class cell-phones recognition.

View on arXiv PDF

Similar