Streaming Reslstm With Causal Mean Aggregation For Device-directed Utterance Detection
2020 Β· Xiaosu Tong, Che-Wei Huang, Sri Harish Mallidi, et al.
Abstract
In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM architecture. The streaming capability is achieved by using unidirectional LSTM layers and a causal mean aggregation layer to form the final utterance-level prediction up to the current frame. In order to avoid redundant computation during online streaming inference, we use a caching mechanism for every convolution operation. Experimental results on a device-directed vs. non device-directed task show that the proposed model yields an equal error rate reduction of 41% compared to our previous best model on this task. Furthermore, we show that the proposed model is able to accurately predict earlier in time compared to the attention-based models.
Authors
(none)
Tags
Stats
Related papers
- Streaming Transformer Transducer Based Speech Recognition Using Non-causal Convolution (2021)8.82
- Device-directed Utterance Detection (2018)10.35
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Parallel Rescoring With Transformer For Streaming On-device Speech Recognition (2020)7.50
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Dynamic Chunk Convolution For Unified Streaming And Non-streaming Conformer ASR (2023)6.77
- Lookahead When It Matters: Adaptive Non-causal Transformers For Streaming Neural Transducers (2023)0.00
- Streaming Voice Query Recognition Using Causal Convolutional Recurrent Neural Networks (2018)0.00