Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones
This work addresses the problem of accent preservation in voice cloning for speech technology applications, highlighting a perceptual gap that is incremental but important for evaluating identity and accent separately.
The study investigated how accent variation affects voice cloning by comparing standard and accented Mandarin speech and their clones, finding that clones were perceived as more similar to originals for standard speakers and that intelligibility improved more for accented speech, with no reliable differences in computational embedding distances.
Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.