CL LGJun 20, 2023

HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur

arXiv:2306.11252v11.33 citationsh-index: 63Has Code

Originality Incremental advance

AI Analysis

This addresses speech translation research for languages like Cantonese where spoken and written forms differ significantly, though it is incremental as it builds on existing corpus and baseline methods.

The authors tackled the challenge of speech translation for languages with non-verbatim transcripts by creating HK-LegiCoST, a large corpus of 600+ hours of Cantonese audio with aligned Chinese and English translations, and demonstrated competitive baselines and cross-corpus results on FLEURS Cantonese.

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or ``noisy'' transcription is common due to various factors, including vernacular and dialectal speech.

View on arXiv PDF Code

Similar