Multimodal Representation Loss Between Timed Text And Audio For Regularized Speech Separation
2024 Β· Tsun-An Hsieh, Heeyoul Choi, Minje Kim
Abstract
Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to improve speech separation models. Our approach involves two steps. We begin with two pretrained audio and language models, WavLM and BERT, respectively. Then, a Transformer-based audio summarizer is learned to align the audio and word embeddings and to minimize their gap. The summarizer Transformer, incorporated as a regularizer, promotes the separated sources' alignment with the semantics from the timed text. Experimental results show that the proposed TTR method consistently improves the various objective metrics of the separation results over the unregularized baselines.
Authors
(none)
Tags
Stats
Related papers
- Time Domain Audio Visual Speech Separation (2019)14.62
- Enhance Audio Generation Controllability Through Representation Similarity Regularization (2023)0.00
- Rtfs-net: Recurrent Time-frequency Modelling For Efficient Audio-visual Speech Separation (2023)0.00
- On Time Domain Conformer Models For Monaural Speech Separation In Noisy Reverberant Acoustic Environments (2023)5.84
- Tf-locoformer: Transformer With Local Modeling By Convolution For Speech Separation And Enhancement (2024)10.35
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- CTAL: Pre-training Cross-modal Transformer For Audio-and-language Representations (2021)7.50