CL LGAug 4, 2023

Speaker Diarization of Scripted Audiovisual Content

Yogesh Virkar, Brian Thompson, Rohit Paturi, Sundararajan Srinivasan, Marcello Federico

AmazonAppleMIT

arXiv:2308.02160v10.92 citationsh-index: 51

Originality Incremental advance

AI Analysis

This addresses a specific challenge in the media localization industry for creating accurate verbatim scripts, though it is incremental as it builds on existing diarization methods.

The paper tackles the problem of speaker diarization in scripted audiovisual content, where current models struggle with tracking many speakers and detecting frequent changes, by proposing a novel semi-supervised approach that uses production scripts for pseudo-labeling, resulting in a 51.7% relative improvement over baseline models on a test set of 66 shows.

The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.

View on arXiv PDF

Similar