LRS-3
Emerging34papers using it
2022first seen
LRS3 is a dataset used to evaluate audio-visual speech recognition (AVSR) systems, containing diverse real-world spoken conversations with human-annotated transcriptions.
Papers using LRS-3 (34)
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast ConformerSeeing Through The Conversation: Audio-visual Speech Separation Based On Diffusion ModelU-hubert: Unified Mixed-modal Speech Pretraining And Zero-shot Transfer To Unlabeled ModalityLitevsr: Efficient Visual Speech Recognition By Learning From Speech Representations Of Unlabeled DataDiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker
Characteristics And IntelligibilityLipvoicer: Generating Speech From Silent Videos Guided By Lip ReadingLarge Language Models are Strong Audio-Visual Speech Recognition
LearnersVisG AV-HuBERT: Viseme-Guided AV-HuBERTLRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognitionPay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech RecognitionNoise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and DecoderTraining Strategies for Modality Dropout Resilient Multi-Modal Target Speaker ExtractionOnline Audio-Visual Autoregressive Speaker ExtractionLipDiffuser: Lip-to-Speech Generation with Conditional Diffusion ModelsMMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech TokensReading to Listen at the Cocktail Party: Multi-Modal Speech SeparationAv-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target RepresentationsAudio-visual Speech Separation In Noisy Environments With A Lightweight Iterative Modelu-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer
to Unlabeled ModalityJointly Learning Visual and Auditory Speech Representations from Raw
DataReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech EnhancementImaginary Voice: Face-styled Diffusion Model for Text-to-SpeechCross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech RecognitionAV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target RepresentationsLipVoicer: Generating Speech from Silent Videos Guided by Lip ReadingSeeing Through the Conversation: Audio-Visual Speech Separation based on
Diffusion ModelVisual Context-driven Audio Feature Enhancement for Robust End-to-End
Audio-Visual Speech RecognitionLip-to-Speech Synthesis in the Wild with Multi-task LearningAuto-AVSR: Audio-Visual Speech Recognition with Automatic LabelsAudio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative ModelAV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech RecognitionLiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled DataMultilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast
ConformerWhisper-Flamingo: Integrating Visual Features into Whisper for
Audio-Visual Speech Recognition and Translation