End-to-end Multimodal Speech Recognition
2018 Β· Shruti Palaskar, Ramon Sanabria, Florian Metze
Abstract
Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data. In previous work, we have shown that the visual channel -- specifically object and scene features -- can help to adapt the acoustic model (AM) and language model (LM) of a recognizer, and we are now expanding this work to end-to-end approaches. In the case of a Connectionist Temporal Classification (CTC)-based approach, we retain the separation of AM and LM, while for a sequence-to-sequence (S2S) approach, both information sources are adapted together, in a single model. This paper also analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal (WSJ) corpus, providing insight into the robustness of both approaches.
Authors
(none)
Tags
Stats
Related papers
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Multi-stream End-to-end Speech Recognition (2019)8.35
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00
- An Investigation Of End-to-end Multichannel Speech Recognition For Reverberant And Mismatch Conditions (2019)0.00
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00
- Multiple-hypothesis Ctc-based Semi-supervised Adaptation Of End-to-end Speech Recognition (2021)5.84