Mowe-audio: Multitask Audiollms With Mixture Of Weak Encoders
2024 Β· Wenyu Zhang, Shuo Sun, Bin Wang, et al.
Abstract
The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
Authors
(none)
Tags
Stats
Related papers
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Ideal-llm: Integrating Dual Encoders And Language-adapted LLM For Multilingual Speech-to-text (2024)5.24
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Au-m-ol: A Unified Model For Medical Audio And Language Understanding (2026)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23