Multi-view Multi-task Representation Learning For Mispronunciation Detection
2023 Β· Yassine El Kheir, Shammur Absar Chowdhury, Ahmed Ali
Abstract
The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset.
Authors
(none)
Tags
Stats
Related papers
- A Full Text-dependent End To End Mispronunciation Detection And Diagnosis With Easy Data Augmentation Techniques (2021)0.00
- Improving Mispronunciation Detection With Wav2vec2-based Momentum Pseudo-labeling For Accentedness And Intelligibility Assessment (2022)7.16
- Coca-mdd: A Coupled Cross-attention Based Framework For Streaming Mispronunciation Detection And Diagnosis (2021)5.84
- Improving End-to-end Modeling For Mispronunciation Detection With Effective Augmentation Mechanisms (2021)0.00
- Speechblender: Speech Augmentation Framework For Mispronunciation Data Generation (2022)2.26
- Speaker-independent Acoustic-to-articulatory Inversion Through Multi-channel Attention Discriminator (2024)0.00
- E2e-based Multi-task Learning Approach To Joint Speech And Accent Recognition (2021)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00