Disentangled-transformer: An Explainable End-to-end Automatic Speech Recognition Model With Speech Content-context Separation
2024 Β· Pu Wang, Hugo van Hamme
Abstract
End-to-end transformer-based automatic speech recognition (ASR) systems often capture multiple speech traits in their learned representations that are highly entangled, leading to a lack of interpretability. In this study, we propose the explainable Disentangled-Transformer, which disentangles the internal representations into sub-embeddings with explicit content and speaker traits based on varying temporal resolutions. Experimental results show that the proposed Disentangled-Transformer produces a clear speaker identity, separated from the speech content, for speaker diarization while improving ASR performance.
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Disentangled Representation Learning For Robust Target Speech Extraction (2023)5.24
- Disentangling Voice And Content With Self-supervision For Speaker Recognition (2023)2.26
- Speaker Disentanglement Of Speech Pre-trained Model Based On Interpretability (2025)0.00
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Unsupervised Learning Of Disentangled Speech Content And Style Representation (2020)7.50
- Intra-class Variation Reduction Of Speaker Representation In Disentanglement Framework (2020)8.35
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- An Effective Mixture-of-experts Approach For Code-switching Speech Recognition Leveraging Encoder Disentanglement (2024)0.00