AS LGApr 3, 2020

Neural i-vectors

Ville Vestman, Kong Aik Lee, Tomi H. Kinnunen

arXiv:2004.01559v23.34 citations

Originality Synthesis-oriented

AI Analysis

This work addresses speaker verification for audio processing, offering an incremental improvement by blending generative and discriminative approaches.

The paper tackled speaker verification by combining deep embeddings with i-vectors to create neural i-vectors, achieving performance about 50% worse than deep embeddings but better than previous i-vector methods on SITW and SRE datasets.

Deep speaker embeddings have been demonstrated to outperform their generative counterparts, i-vectors, in recent speaker verification evaluations. To combine the benefits of high performance and generative interpretation, we investigate the use of deep embedding extractor and i-vector extractor in succession. To bundle the deep embedding extractor with an i-vector extractor, we adopt aggregation layers inspired by the Gaussian mixture model (GMM) to the embedding extractor networks. The inclusion of GMM-like layer allows the discriminatively trained network to be used as a provider of sufficient statistics for the i-vector extractor to extract what we call neural i-vectors. We compare the deep embeddings to the proposed neural i-vectors on the Speakers in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019 datasets. On the core-core condition of SITW, our deep embeddings obtain performance comparative to the state-of-the-art. The neural i-vectors obtain about 50% worse performance than the deep embeddings, but on the other hand outperform the previous i-vector approaches reported in the literature by a clear margin.

View on arXiv PDF

Similar