ASAIHCLGSDAug 14, 2025

Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

arXiv:2508.10332v1h-index: 22WOCCI
Originality Synthesis-oriented
AI Analysis

This provides insights for developing child-aware speech interfaces by understanding how speaker traits are structured across model layers, though it's incremental as it applies existing methods to a specific domain.

This paper analyzed how self-supervised speech models encode age and gender traits in children's speech, finding that early layers (1-7) capture speaker-specific cues better than deeper layers, with Wav2Vec2-large-lv60 achieving up to 97.14% accuracy for age and 98.20% for gender classification.

Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes