Similarity And Content-based Phonetic Self Attention For Speech Recognition
2022 Β· Kyuhong Shim, Wonyong Sung
Abstract
Transformer-based speech recognition models have achieved great success due to the self-attention (SA) mechanism that utilizes every frame in the feature extraction process. Especially, SA heads in lower layers capture various phonetic characteristics by the query-key dot product, which is designed to compute the pairwise relationship between frames. In this paper, we propose a variant of SA to extract more representative phonetic features. The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention; one is similarity-based and the other is content-based. In short, similarity-based attention captures the correlation between frames while content-based attention only considers each frame without being affected by other frames. We identify which parts of the original dot product equation are related to two different attention patterns and improve each part with simple modifications. Our experiments on phoneme classification and speech recognition s
Authors
(none)
Tags
Stats
Related papers
- Transformer-based End-to-end Speech Recognition With Local Dense Synthesizer Attention (2020)12.04
- Simplified Self-attention For Transformer-based End-to-end Speech Recognition (2020)10.61
- Self-attention Transducers For End-to-end Speech Recognition (2019)11.93
- Transformer-based End-to-end Speech Recognition With Residual Gaussian-based Self-attention (2021)5.84
- Gaussian Kernelized Self-attention For Long Sequence Data And Its Application To Ctc-based Speech Recognition (2021)4.52
- Input-independent Attention Weights Are Expressive Enough: A Study Of Attention In Self-supervised Audio Transformers (2020)0.00
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- A Comparison Of Transformer, Convolutional, And Recurrent Neural Networks On Phoneme Recognition (2022)0.00