Scala: Supervised Contrastive Learning For End-to-end Speech Recognition
2021 Β· Li Fu, Xiaoxiao Li, Runyu Wang, et al.
Abstract
End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels to mitigate the noise introduced by positive-negative pairs in self-supervised MCPC. Experiments on
Authors
(none)
Tags
Stats
Related papers
- Guided Contrastive Self-supervised Pre-training For Automatic Speech Recognition (2022)0.00
- Segmental Contrastive Predictive Coding For Unsupervised Word Segmentation (2021)0.00
- Joint Masked CPC And CTC Training For ASR (2020)8.60
- Aligned Contrastive Predictive Coding (2021)9.23
- Integrating Source-channel And Attention-based Sequence-to-sequence Models For Speech Recognition (2019)8.09
- Unsupervised Speech Segmentation And Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding (2021)9.92
- Contrastive Prediction Strategies For Unsupervised Segmentation And Categorization Of Phonemes And Words (2021)9.23
- Learning Speech Representation From Contrastive Token-acoustic Pretraining (2023)7.81