Visual Keyword Spotting With Attention
2021 Β· K R Prajwal, Liliane Momeni, Triantafyllos Afouras, et al.
Abstract
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
Authors
(none)
Tags
Stats
Related papers
- Keyword Transformer: A Self-attention Model For Keyword Spotting (2021)15.31
- Exploring Sequence-to-sequence Transformer-transducer Models For Keyword Spotting (2022)5.24
- Audiomer: A Convolutional Transformer For Keyword Spotting (2021)0.00
- Small-footprint Open-vocabulary Keyword Spotting With Quantized LSTM Networks (2020)0.00
- Efficient Keyword Spotting By Capturing Long-range Interactions With Temporal Lambda Networks (2021)0.00
- Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)11.39
- Speech Recognition: Keyword Spotting Through Image Recognition (2018)0.00
- Separable Temporal Convolution Plus Temporally Pooled Attention For Lightweight High-performance Keyword Spotting (2021)0.00