CLApr 30, 2023

Building a Non-native Speech Corpus Featuring Chinese-English Bilingual Children: Compilation and Rationale

Hiuchung Hung, Andreas Maier, Thorsten Piske

arXiv:2305.00446v2h-index: 3

Originality Synthesis-oriented

AI Analysis

This provides a resource for second language teaching and ASR improvement, but it is incremental as it focuses on data collection for a specific domain.

The paper compiled a non-native speech corpus of 6.5 hours of English narratives from fifty 5- to 6-year-old Chinese-English bilingual children, including transcripts, error annotations, and human-rated scores, to address challenges in transcribing low-intelligibility L2 speech.

This paper introduces a non-native speech corpus consisting of narratives from fifty 5- to 6-year-old Chinese-English children. Transcripts totaling 6.5 hours of children taking a narrative comprehension test in English (L2) are presented, along with human-rated scores and annotations of grammatical and pronunciation errors. The children also completed the parallel MAIN tests in Chinese (L1) for reference purposes. For all tests we recorded audio and video with our innovative self-developed remote collection methods. The video recordings serve to mitigate the challenge of low intelligibility in L2 narratives produced by young children during the transcription process. This corpus offers valuable resources for second language teaching and has the potential to enhance the overall performance of automatic speech recognition (ASR).

View on arXiv PDF

Similar