Awesome Sound

dots.tts Technical Report (2026)

Shi Lian et al.

9.02

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding (2026)

Zhiyuan Zhu et al.

5.88

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI (2026)

Sejal Bhalla et al.

5.49

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition (2026)

Seung Hwan Cho et al.

5.01

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition (2026)

Fernando L\'opez et al.

5.01

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation (2026)

Dongwook Lee et al.

5.01

A Neuromorphic Trigger for Efficient Audio Event Detection (2026)

Benjamin Hatton et al.

5.01

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks (2026)

Taharim Rahman Anon et al.

5.01

Bielik 11B v3: Multilingual Large Language Model for European Languages (2026)

Krzysztof Ociepa et al.

4.70

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities (2026)

Sajad Ebrahimi et al.

4.39

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition (2026)

Yifan Liao et al.

4.39

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation (2026)

Dinghao Zhou et al.

4.39

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation (2026)

Xuzhi Wang et al.

4.39

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents (2026)

Chibuzor Okocha et al.

4.39

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark (2026)

Hongyu Jin et al.

4.39

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models (2026)

Tsung-En Lin et al.

4.39

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains (2026)

Zilai Wang et al.

4.39

Towards Personalized Federated Learning for Dysarthric Speech Recognition (2026)

Tao Zhong et al.

4.39

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches (2026)

Dezhi Yu et al.

4.39

Multimodal Speaker Identification in Classroom Environments (2026)

Michael L. Chrzan et al.

4.39

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech (2026)

Alef Iury Siqueira Ferreira et al.

4.39

FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding (2026)

Ziwei Wang et al.

4.39

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources (2026)

Oh Hyun-Bin et al.

4.39

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions (2026)

Piotr Kit{\l}owski et al.

4.39

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms (2026)

Chen Ying Claude et al.

4.39

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing (2026)

Hugo Daumain et al.

4.39

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models (2026)

Ravi Ranjan et al.

4.39

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech (2026)

Adarsh Arigala et al.

4.39

From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation (2026)

Fengrui Liu et al.

4.39

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection (2026)

Elham Abolhasani et al.

4.39

Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback (2026)

Rong Wang et al.

4.39

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition (2026)

Andrei-Marius Avram et al.

4.39

Scaling Human and G2P Supervision for Robust Phonetic Transcription (2026)

Alexander Metzger et al.

4.39

Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas (2026)

Mudit Sinha et al.

4.39

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models (2026)

Yupei Li et al.

4.39

TMASC: Transmasculine Attitude and Speech Corpus (2026)

Sidney Wong

4.39

Learning aligned EEG representations with subject-specific encoders (2026)

Bruna J. Lopes et al.

4.39

From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text (2026)

Sadia Noor et al.

4.39

Connecting Speech to Words through Images (2026)

Gabriel Pirlogeanu et al.

4.39

Robust Spoofed Speech Detection via Temporal Pyramid Modeling (2026)

Mahtab Masoudi Nezhad et al.

4.39

Data-Driven Decoding of Russell's Circumplex Model of Affect (2026)

Amdjed Belaref et al.

4.39

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control (2026)

Joon-Seung Choi et al.

4.39

Turning music identification into a neural forward pass (2026)

Muhammad Taimoor Haseeb et al.

4.39

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification (2026)

Hyung-Seok Oh et al.

4.39

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models (2026)

Apoorva Kulkarni et al.

4.39

Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews (2026)

Franziska Braun et al.

4.39

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD) (2026)

Sinclair Gurny et al.

4.39

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film (2026)

Nelly Garcia et al.

4.39

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data (2026)

Subhankar Ghosh et al.

4.39

Fair Cognitive Impairment Detection Through Unlearning (2026)

William Nguyen et al.

4.39

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects (2026)

Fan Xu et al.

4.39

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation (2026)

Fan Xu et al.

4.39

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement (2026)

Shogo Yamauchi et al.

4.39

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization (2026)

Yizhuo Yang et al.

4.39

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation (2026)

Ioannis Prokopiou et al.

4.39

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors (2026)

Michael Finkelson et al.

4.39

Dealing with Annotator Disagreement in Hate Speech Classification (2025)

Somaiyeh Dehghan et al.

3.64

BareWave: Waveform-Native Flow-Matching Text-to-Speech (2026)

Wei Fan et al.

3.51

Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems (2026)

Terence Zeng et al.

3.51

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models (2026)

Zachary Nicholas Houghton et al.

3.51

Awesome Sound

Datasets & benchmarks

Key papers