Awesome Papers

Papers

Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras (2021)
Ander Arriandiaga et al.
—
Development and Evaluation of Video Recordings for the OLSA Matrix Sentence Test (2021)
Gerard Llorach et al.
—
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation (2021)
Daniel Michelsanti et al.
—
Audio-Visual Speech Inpainting with Deep Learning (2021)
Giovanni Morrone et al.
—
An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments (2022)
Shrishti Saha Shetu et al.
—
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency (2021)
Ruohan Gao and Kristen Grauman
—
Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement (2022)
Xinmeng Xu and Jianjun Hao
—
Learning Audio-Visual Correlations from Variational Cross-Modal Generation (2021)
Ye Zhu et al.
—
Active Audio-Visual Separation of Dynamic Sound Sources (2022)
Sagnik Majumder and Kristen Grauman
—
Learning Sound Localization Better From Semantically Similar Samples (2022)
Arda Senocak et al.
—
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition (2022)
Zi-Qiang Zhang et al.
—
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion (2022)
Disong Wang et al.
—
Learning English with Peppa Pig (2023)
Mitja Nikolaus and Afra Alishahi and Grzegorz Chrupa{\l}a
—
Visually Supervised Speaker Detection and Localization via Microphone Array (2022)
Davide Berghi et al.
—
Deep CardioSound-An Ensembled Deep Learning Model for Heart Sound MultiLabelling (2022)
Li Guo et al.
—
The 2021 NIST Speaker Recognition Evaluation (2022)
Seyed Omid Sadjadi and Craig Greenberg and Elliot Singer and Lisa Mason and Douglas Reynolds
—
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations (2022)
Dan Oneata et al.
—
VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution (2022)
Liangbin Xie. Xintao Wang et al.
—
Perceptual Evaluation on Audio-visual Dataset of 360 Content (2022)
Randy F Fela et al.
—
FlexLip: A Controllable Text-to-Lip System (2022)
Dan Oneata et al.
—
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos (2022)
Alexander Waibel and Moritz Behr and Fevziye Irem Eyiokur and Dogucan Yaman and Tuan-Nam Nguyen and Carlos Mullov and Mehmet Arif Demirtas and Alperen Kantarc{\i} and Stefan Constantin and Haz{\i}m Kemal Ekenel
—
Show Me Your Face, And I'll Tell You How You Speak (2022)
Christen Millerdurai et al.
—
Graph-based Multi-View Fusion and Local Adaptation: Mitigating Within-Household Confusability for Speaker Identification (2023)
Long Chen et al.
—
Audio-Visual Segmentation (2023)
Jinxing Zhou et al.
—
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition (2022)
Joanna Hong et al.
—
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality (2022)
Wei-Ning Hsu et al.
—
Speaker-adaptive Lip Reading with User-dependent Padding (2022)
Minsu Kim et al.
—
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation (2024)
Dongchan Min et al.
—
Prospectively accelerated dynamic speech MRI at 3 Tesla using a self-navigated spiral based manifold regularized scheme (2023)
Rushdi Zahid Rusho et al.
—
Unsupervised active speaker detection in media content using cross-modal information (2022)
Rahul Sharma and Shrikanth Narayanan
—
Multi-Source Transformer Architectures for Audiovisual Scene Classification (2022)
Wim Boes et al.
—
Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function (2022)
Qing Wang et al.
—
SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory (2022)
Se Jin Park et al.
—
Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages (2022)
Anusha Prakash et al.
—
MarginNCE: Robust Sound Localization with a Negative Margin (2022)
Sooyoung Park et al.
—
AVATAR submission to the Ego4D AV Transcription Challenge (2022)
Paul Hongsuck Seo et al.
—
DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model (2023)
Fan Zhang et al.
—
Synthesizing audio from tongue motion during speech using tagged MRI via transformer (2023)
Xiaofeng Liu et al.
—
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition (2024)
Minsu Kim et al.
—
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification (2023)
Meng Liu et al.
—
UniFLG: Unified Facial Landmark Generator from Text or Speech (2023)
Kentaro Mitsui et al.
—
Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model (2023)
Jaeyoung Huh et al.
—
SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks (2023)
Naoki Kimura et al.
—
WASD: A Wilder Active Speaker Detection Dataset (2023)
Tiago Roxo et al.
—
ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers (2023)
Akash Gupta et al.
—
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert (2023)
Jiadong Wang et al.
—
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment (2023)
Kim Sung-Bin et al.
—
Deep sound-field denoiser: optically-measured sound-field denoising using deep neural network (2023)
Kenji Ishikawa et al.
—
Towards Ultrasound Tongue Image prediction from EEG during speech production (2023)
Tam\'as G\'abor Csap\'o et al.
—
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model (2024)
Jeong Hun Yeo et al.
—