Prompt Tuning Of Deep Neural Networks For Speaker-adaptive Visual Speech Recognition
2023 Β· Minsu Kim, Hyung-Il Kim, Yong Man Ro
Abstract
Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general
Authors
(none)
Tags
Stats
Related papers
- Intapt: Information-theoretic Adversarial Prompt Tuning For Enhanced Non-native Speech Recognition (2023)3.58
- Unipet-spk: A Unified Framework For Parameter-efficient Tuning Of Pre-trained Speech Models For Robust Speaker Verification (2025)4.52
- Speaker-independent Speech-driven Visual Speech Synthesis Using Domain-adapted Acoustic Models (2019)5.84
- Speaker-adaptive Neural Vocoders For Parametric Speech Synthesis Systems (2018)2.26
- Efficient Adapter Tuning Of Pre-trained Speech Models For Automatic Speaker Verification (2024)0.00
- Speaker Adaptation Using Spectro-temporal Deep Features For Dysarthric And Elderly Speech Recognition (2022)12.02
- Diffv2s: Diffusion-based Video-to-speech Synthesis With Vision-guided Speaker Embedding (2023)8.82
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76