Investigating Lexical Replacements For Arabic-english Code-switched Data Augmentation
2022 Β· Injy Hamed, Nizar Habash, Slim Abdennadher, et al.
Abstract
Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improv
Authors
(none)
Tags
Stats
Related papers
- Textual Data Augmentation For Arabic-english Code-switching Speech Recognition (2022)6.77
- Data Augmentation For End-to-end Code-switching Speech Recognition (2020)9.92
- The Impact Of Code-switched Synthetic Data Quality Is Task Dependent: Insights From MT And ASR (2025)0.00
- Code-switching Sentence Generation By Generative Adversarial Networks And Its Application To Data Augmentation (2018)0.00
- Acoustic And Textual Data Augmentation For Improved ASR Of Code-switching Speech (2018)9.92
- Improving Low Resource Code-switched ASR Using Augmented Code-switched TTS (2020)7.50
- Speech Collage: Code-switched Audio Generation By Collaging Monolingual Corpora (2023)3.58
- Language-agnostic Code-switching In Sequence-to-sequence Speech Recognition (2022)0.00