Reading Between The Waves: Robust Topic Segmentation Using Inter-sentence Audio Features
2026 Β· Steffen Freisinger, Philipp Seeberger, Tobias Bocklet, et al.
Abstract
Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
Authors
(none)
Tags
Stats
Related papers
- Segmental Audio Word2vec: Representing Utterances As Sequences Of Vectors With Applications In Spoken Term Detection (2018)11.08
- Topic Identification For Spontaneous Speech: Enriching Audio Features With Embedded Linguistic Information (2023)4.52
- Smart Speech Segmentation Using Acousto-linguistic Features With Look-ahead (2022)0.00
- Don't Discard Fixed-window Audio Segmentation In Speech-to-text Translation (2022)0.00
- Subtitles To Segmentation: Improving Low-resource Speech-to-text Translation Pipelines (2020)0.00
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- Improved Audio Embeddings By Adjacency-based Clustering With Applications In Spoken Term Detection (2018)0.00
- Toward Unifying Text Segmentation And Long Document Summarization (2022)8.60