SDMay 28

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

arXiv:2605.2925773.0h-index: 19
Predicted impact top 23% in SD · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in child development and speech technology, ChildVox provides the first comprehensive benchmark to systematically evaluate models on diverse child audio signals across developmental stages.

ChildVox is a benchmark covering 17 child-centered audio datasets across 20+ sub-tasks from birth to school age. Evaluations show that existing audio and speech foundation models achieve high performance on recognizing children's acoustic signals, enabling applications like language level characterization and speech production tracking.

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes