Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🎯Research Radar 🔖Saved

← authors · overview

Loading author…

Stay Updated

E-Mail Digest 🎯 Research Radar

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

David Harwath — most-cited papers & profile · Speech Audio

← authors · overview

David Harwath

37 papers · 79 citations · 0 h-index

The University of Texas at Austin

Google Scholar ↗Semantic Scholar ↗OpenAlex ↗

Most-cited papers

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
2022 · 19 citations
Transfer Learning From Audio-visual Grounding To Speech Recognition
2019 · 18 citations
Learning Hierarchical Discrete Linguistic Units From Visually-grounded Speech
2019 · 11 citations
Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
2018 · 9 citations
Fast-Slow Transformer for Visually Grounding Speech
2021 · 7 citations
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
2025 · 5 citations
M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval
2022 · 2 citations
Textless Speech-to-Speech Translation With Limited Parallel Data

2023 · 2 citations

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

2024 · 1 citations

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

2025 · 1 citations

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

2022 · 1 citations

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

2022 · 1 citations

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

2023 · 1 citations

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

2023 · 1 citations

Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

2026

Top co-authors

Hung-yi Lee · 6 Eunsol Choi · 5 Abdelrahman Mohamed · 4 James Glass · 4 Kwanghee Choi · 4 Shang-Wen Li · 4 David R. Mortensen · 3 Eunjung Yeo · 3 Heng-Jui Chang · 3 Wei-Cheng Tseng · 3 Kai-Wei Chang · 2 Kuan-Po Huang · 2

Topics

Speech Recognition Audio Understanding Text-to-Speech Multimodal Audio cs.CL Speech Translation eess.AS Audio Generation cs.SD cs.LG