Data mining Mandarin tone contour shapes
This work addresses variability in tone production for linguists and speech processing researchers, but it is incremental as it builds on existing methods for analyzing tone contours.
The study tackled the problem of variability in Mandarin tone contour shapes in spontaneous speech by applying data mining and NLP techniques to a large corpus, revealing correlations between contour shape types and linguistic features.
In spontaneous speech, Mandarin tones that belong to the same tone category may exhibit many different contour shapes. We explore the use of data mining and NLP techniques for understanding the variability of tones in a large corpus of Mandarin newscast speech. First, we adapt a graph-based approach to characterize the clusters (fuzzy types) of tone contour shapes observed in each tone n-gram category. Second, we show correlations between these realized contour shape types and a bag of automatically extracted linguistic features. We discuss the implications of the current study within the context of phonological and information theory.