Audio Visual Segmentation Through Text Embeddings
2025 Β· Kyungbok Lee, You Zhang, Zhiyao Duan
Abstract
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose \textbf\{AV2T-SAM\}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel f
Authors
(none)
Tags
Stats
Related papers
- AV-SAM: Segment Anything Model Meets Audio-visual Localization And Segmentation (2023)0.00
- A Study On Joint Modeling And Data Augmentation Of Multi-modalities For Audio-visual Scene Classification (2022)5.24
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60
- Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR (2023)7.81
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- Contrastive Conditional Latent Diffusion For Audio-visual Segmentation (2023)10.29
- Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer (2021)8.09