Transcription And Translation Of Videos Using Fine-tuned XLSR Wav2vec2 On Custom Dataset And Mbart
2024 Β· Aniket Tathe, Anand Kamble, Suyash Kumbharkar, et al.
Abstract
This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.
Authors
(none)
Tags
Stats
Related papers
- End To End Hindi To English Speech Conversion Using Bark, Mbart And A Finetuned XLSR Wav2vec2 (2024)0.00
- Custom Data Augmentation For Low Resource ASR Using Bark And Retrieval-based Voice Conversion (2023)0.00
- Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)11.39
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- XLAVS-R: Cross-lingual Audio-visual Speech Representation Learning For Noise-robust Speech Perception (2024)7.50
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- Enhancing Polyglot Voices By Leveraging Cross-lingual Fine-tuning In Any-to-one Voice Conversion (2024)0.00