End-to-end Speech Recognition And Disfluency Removal With Acoustic Language Model Pretraining
2023 Β· Saksham Bassi, Giulio Duregon, Siddhartha Jalagam, et al.
Abstract
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models, and fu
Authors
(none)
Tags
Stats
Related papers
- Streaming Joint Speech Recognition And Disfluency Detection (2022)0.00
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Improving Hybrid Ctc/attention End-to-end Speech Recognition With Pretrained Acoustic And Language Model (2021)8.82
- Stutter-solver: End-to-end Multi-lingual Dysfluency Detection (2024)5.24
- Three-module Modeling For End-to-end Spoken Language Understanding Using Pre-trained Dnn-hmm-based Acoustic-phonetic Model (2022)3.58
- Phonetic And Prosody-aware Self-supervised Learning Approach For Non-native Fluency Scoring (2023)3.58
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00