Spontaneous Informal Speech Dataset for Punctuation Restoration
This addresses the discrepancy between model evaluation and real-world ASR applications for researchers and practitioners in speech processing, though it is incremental as it focuses on dataset creation rather than a new method.
The authors tackled the problem that punctuation restoration models are typically evaluated on scripted corpora rather than spontaneous speech, by introducing SponSpeech, a dataset derived from informal speech sources with punctuation and casing information. They also contributed a filtering pipeline for generating more data and constructed a challenging test set to evaluate models' ability to use audio information for ambiguous punctuation prediction.
Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.