Exploring Wav2vec 2.0 Fine-tuning For Improved Speech Emotion Recognition
2021 Β· Li-Wei Chen, Alexander Rudnicky
Abstract
While Wav2Vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT, especially under low-resource settings. Compared to prior works in this literature, our top-line system achieved a 7.4% absolute improvement in unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available.
Authors
(none)
Tags
Stats
Related papers
- Active Learning Based Fine-tuning Framework For Speech Emotion Recognition (2023)6.34
- Supervised Contrastive Learning With Nearest Neighbor Search For Speech Emotion Recognition (2023)7.16
- Active Learning With Task Adaptation Pre-training For Speech Emotion Recognition (2024)5.84
- Dawn Of The Transformer Era In Speech Emotion Recognition: Closing The Valence Gap (2022)18.59
- Wav2small: Distilling Wav2vec2 To 72K Parameters For Low-resource Speech Emotion Recognition (2024)0.00
- Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition (2023)10.97
- Vesper: A Compact And Effective Pretrained Model For Speech Emotion Recognition (2023)0.00
- Unsupervised Representations Improve Supervised Learning In Speech Emotion Recognition (2023)0.00