Awesome Sound
Sound is one of the most active areas in Awesome AI Agents — 60 papers in this collection, evaluated on datasets like Alpaca-Eval, InstructS-2S-Eval, Llama Questions. A strong starting point is "dots.tts Technical Report".
Datasets & benchmarks
Key papers
- dots.tts Technical Report (2026)Shi Lian et al.9.02
- Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding (2026)Zhiyuan Zhu et al.5.88
- SpeechDx: A Multi-Task Benchmark for Clinical Speech AI (2026)Sejal Bhalla et al.5.49
- Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition (2026)Seung Hwan Cho et al.5.01
- FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition (2026)Fernando L\'opez et al.5.01
- NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation (2026)Dongwook Lee et al.5.01
- A Neuromorphic Trigger for Efficient Audio Event Detection (2026)Benjamin Hatton et al.5.01
- Adaptive Speech-to-Spike Encoding for Spiking Neural Networks (2026)Taharim Rahman Anon et al.5.01
- Bielik 11B v3: Multilingual Large Language Model for European Languages (2026)Krzysztof Ociepa et al.4.70
- DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities (2026)Sajad Ebrahimi et al.4.39
- Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition (2026)Yifan Liao et al.4.39
- F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation (2026)Dinghao Zhou et al.4.39
- MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation (2026)Xuzhi Wang et al.4.39
- Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents (2026)Chibuzor Okocha et al.4.39
- RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark (2026)Hongyu Jin et al.4.39
- Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models (2026)Tsung-En Lin et al.4.39
- Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains (2026)Zilai Wang et al.4.39
- Towards Personalized Federated Learning for Dysarthric Speech Recognition (2026)Tao Zhong et al.4.39
- Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches (2026)Dezhi Yu et al.4.39
- Multimodal Speaker Identification in Classroom Environments (2026)Michael L. Chrzan et al.4.39
- Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech (2026)Alef Iury Siqueira Ferreira et al.4.39
- FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding (2026)Ziwei Wang et al.4.39
- Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources (2026)Oh Hyun-Bin et al.4.39
- The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions (2026)Piotr Kit{\l}owski et al.4.39
- Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms (2026)Chen Ying Claude et al.4.39
- From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing (2026)Hugo Daumain et al.4.39
- Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models (2026)Ravi Ranjan et al.4.39
- Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech (2026)Adarsh Arigala et al.4.39
- From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation (2026)Fengrui Liu et al.4.39
- Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection (2026)Elham Abolhasani et al.4.39
- Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback (2026)Rong Wang et al.4.39
- ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition (2026)Andrei-Marius Avram et al.4.39
- Scaling Human and G2P Supervision for Robust Phonetic Transcription (2026)Alexander Metzger et al.4.39
- Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas (2026)Mudit Sinha et al.4.39
- XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models (2026)Yupei Li et al.4.39
- TMASC: Transmasculine Attitude and Speech Corpus (2026)Sidney Wong4.39
- Learning aligned EEG representations with subject-specific encoders (2026)Bruna J. Lopes et al.4.39
- From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text (2026)Sadia Noor et al.4.39
- Connecting Speech to Words through Images (2026)Gabriel Pirlogeanu et al.4.39
- Robust Spoofed Speech Detection via Temporal Pyramid Modeling (2026)Mahtab Masoudi Nezhad et al.4.39
- Data-Driven Decoding of Russell's Circumplex Model of Affect (2026)Amdjed Belaref et al.4.39
- Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control (2026)Joon-Seung Choi et al.4.39
- Turning music identification into a neural forward pass (2026)Muhammad Taimoor Haseeb et al.4.39
- L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification (2026)Hyung-Seok Oh et al.4.39
- A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models (2026)Apoorva Kulkarni et al.4.39
- Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews (2026)Franziska Braun et al.4.39
- Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD) (2026)Sinclair Gurny et al.4.39
- EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film (2026)Nelly Garcia et al.4.39
- MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data (2026)Subhankar Ghosh et al.4.39
- Fair Cognitive Impairment Detection Through Unlearning (2026)William Nguyen et al.4.39
- Speech-Driven End-to-End Language Discrimination towards Chinese Dialects (2026)Fan Xu et al.4.39
- Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation (2026)Fan Xu et al.4.39
- QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement (2026)Shogo Yamauchi et al.4.39
- NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization (2026)Yizhuo Yang et al.4.39
- Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation (2026)Ioannis Prokopiou et al.4.39
- Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors (2026)Michael Finkelson et al.4.39
- Dealing with Annotator Disagreement in Hate Speech Classification (2025)Somaiyeh Dehghan et al.3.64
- BareWave: Waveform-Native Flow-Matching Text-to-Speech (2026)Wei Fan et al.3.51
- Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems (2026)Terence Zeng et al.3.51
- The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models (2026)Zachary Nicholas Houghton et al.3.51