ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Tiantian Feng·Anfeng Xu·Xuan Shi·Aditya Kommineni·Shakhrul Iman Siam·Megan Micheletti·Zhonghao Shi·Helen Tager-Flusberg·Mi Zhang·Lynn K. Perry·Catherine Lord·Daniel Messinger·Shrikanth Narayanan·2026

arXiv:2605.29257 ↗Google Scholar ↗Semantic Scholar ↗

cs.SD

Abstract

arXiv:2605.29257v1 Announce Type: new Abstract: We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Abstract

Related papers