A Comparison Study on Infant-Parent Voice Diarization
This work addresses the problem of accurately distinguishing infant and parent voices in audio recordings for developmental research, but it is incremental as it builds on existing diarization methods with component swaps and optimizations.
The paper tackled infant-parent voice diarization by developing a framework using state-of-the-art algorithms, achieving a 43.8% diarization error rate (DER) on a test dataset, which is an improvement over the 55.4% DER from LENA software.
We design a framework for studying prelinguistic child voicefrom 3 to 24 months based on state-of-the-art algorithms in di-arization. Our system consists of a time-invariant feature ex-tractor, a context-dependent embedding generator, and a clas-sifier. We study the effect of swapping out different compo-nents of the system, as well as changing loss function, to findthe best performance. We also present a multiple-instancelearning technique that allows us to pre-train our parame-ters on larger datasets with coarser segment boundary labels.We found that our best system achieved 43.8% DER on testdataset, compared to 55.4% DER achieved by LENA soft-ware. We also found that using convolutional feature extrac-tor instead of logmel features significantly increases the per-formance of neural diarization.