Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework
This work addresses speech processing for applications like synthesis and analysis, but it is incremental, building on previous synthetic data approaches.
The paper tackles the problem of estimating glottal closure instants (GCI) in speech by introducing a semi-supervised training strategy that refines a neural estimator using real speech without ground truth, achieving a very significant improvement in GCI analysis.
This article investigates into recently emerging approaches that use deep neural networks for the estimation of glottal closure instants (GCI). We build upon our previous approach that used synthetic speech exclusively to create perfectly annotated training data and that had been shown to compare favourably with other training approaches using electroglottograph (EGG) signals. Here we introduce a semi-supervised training strategy that allows refining the estimator by means of an analysis-synthesis setup using real speech signals, for which GCI ground truth does not exist. Evaluation of the analyser is performed by means of comparing the GCI extracted from the glottal flow signal generated by the analyser with the GCI extracted from EGG on the CMU arctic dataset, where EGG signals were recorded in addition to speech. We observe that (1.) the artificial increase of the diversity of pulse shapes that has been used in our previous construction of the synthetic database is beneficial, (2.) training the GCI network in the analysis-synthesis setup allows achieving a very significant improvement of the GCI analyser, (3.) additional regularisation strategies allow improving the final analysis network when trained in the analysis-synthesis setup.