Towards Lipreading Sentences With Active Appearance Models
2018 Β· George Sterpu, Naomi Harte
Abstract
Automatic lipreading has major potential impact for speech recognition, supplementing and complementing the acoustic modality. Most attempts at lipreading have been performed on small vocabulary tasks, due to a shortfall of appropriate audio-visual datasets. In this work we use the publicly available TCD-TIMIT database, designed for large vocabulary continuous audio-visual speech recognition. We compare the viseme recognition performance of the most widely used features for lipreading, Discrete Cosine Transform (DCT) and Active Appearance Models (AAM), in a traditional Hidden Markov Model (HMM) framework. We also exploit recent advances in AAM fitting. We found the DCT to outperform AAM by more than 6% for a viseme recognition task with 56 speakers. The overall accuracy of the DCT is quite low (32-34%). We conclude that a fundamental rethink of the modelling of visual features may be needed for this task.
Authors
(none)
Tags
Stats
Related papers
- Lipreading With 3D-2D-CNN BLSTM-HMM And Word-ctc Models (2019)0.00
- Can Dnns Learn To Lipread Full Sentences? (2018)6.77
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Lipformer: Learning To Lipread Unseen Speakers Based On Visual-landmark Transformers (2023)11.49
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Lipper: Synthesizing Thy Speech Using Multi-view Lipreading (2019)10.61
- Spatio-temporal Attention Mechanism And Knowledge Distillation For Lip Reading (2021)0.00
- Simullr: Simultaneous Lip Reading Transducer With Attention-guided Adaptive Memory (2021)8.09