CLAILGASJan 9, 2025

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

arXiv:2501.05234v111 citationsh-index: 2NoDaLiDa/Baltic-HLT
Originality Synthesis-oriented
AI Analysis

This is an incremental improvement for Estonian TV content accessibility, focusing on domain-specific subtitle generation.

The paper tackled generating high-quality Estonian TV subtitles by fine-tuning Whisper with pseudo-labeling and LLM-based post-editing, resulting in notable subtitle quality improvement and enhanced accuracy at test time.

This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes