Improving The Gap In Visual Speech Recognition Between Normal And Silent Speech Based On Metric Learning
2023 Β· Sara Kashiwagi, Keitaro Tanaka, Qi Feng, et al.
Abstract
This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes. Specifically, we aim to map the input of two speech types close to each other in a latent space if they have similar viseme representations. By minimizing the Kullback-Leibler divergence of the predicted viseme probability distributions between and within the two speech types, our model effectively learns and predicts viseme identities. Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.
Authors
(none)
Tags
Stats
Related papers
- Multi-task Metric Learning For Text-independent Speaker Verification (2020)0.00
- Metric Learning With Progressive Self-distillation For Audio-visual Embedding Learning (2025)3.58
- Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping (2023)6.77
- A Comparison Of Metric Learning Loss Functions For End-to-end Speaker Verification (2020)6.77
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Metricnet: Towards Improved Modeling For Non-intrusive Speech Quality Assessment (2021)0.00
- Visual Gesture Variability Between Talkers In Continuous Visual Speech (2017)0.00
- Syncvsr: Data-efficient Visual Speech Recognition With End-to-end Crossmodal Audio Token Synchronization (2024)8.35