DeepLip: A Hybrid CNN-RNN-LSTM Framework for End-to- End Visual Speech Recognition using CTC Loss

Abstract

Lip reading is the process of comprehending speech by interpreting lip movements. Because it can be used in audiovisual speech recognition, optimization, and separation, it has drawn much attention. Traditional solutions mainly relied on CNNs like ResNet in order to extract spatial information from the video frames. However, CNNs do not perform satisfactorily in capturing temporal correlations, which will make multi-modal systems more computationally expensive and increase latency. We combine RNNs and LSTMs when modeling temporal changes, which also face scaling challenges. In this paper, we propose DeepLip, a unified CNN-RNN- LSTM architecture for end-to-end visual speech recognition. DeepLip effectively integrates the feature of spatial property extraction with temporal embeddings. These embeddings leverage the strengths of both convolutional and recurrent layers to model both local and sequential dynamics well. They therefore work well for alignment-based training with CTC Loss that enables word and sentence level recognition at high levels. Our experiments, based on two datasets, English LRW and Mandarin LRW-1000, show that DeepLip outperforms the current state-of-the-art while being more efficient and cheaper to run

Abstract

Related papers