Sentence-wise Speech Summarization: Task, Datasets, And End-to-end Modeling With LM Knowledge Distillation
2024 Β· Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, et al.
Abstract
This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Using these datasets, our study evaluates two types of Transformer-based models: 1) cascade models that combine ASR and strong text summarization models, and 2) end-to-end (E2E) models that directly convert speech into a text summary. While E2E models are appealing to develop compute-efficient models, they perform worse than cascade models. Therefore, we propose knowledge distillation for E2E models using pseudo-summaries generated by the cascade models. Our experiments show that this proposed knowledge distillation effectively improves the performance of the E2E model on both datasets.
Authors
(none)
Tags
Stats
Related papers
- Transfer Learning From Pre-trained Language Models Improves End-to-end Speech Summarization (2023)6.77
- Speech Summarization Using Restricted Self-attention (2021)0.00
- Augsumm: Towards Generalizable Speech Summarization Using Synthetic Labels From Large Language Model (2024)4.53
- Prompting Large Language Models With Audio For General-purpose Speech Summarization (2024)6.34
- Toward Unifying Text Segmentation And Long Document Summarization (2022)8.60
- Leverage Unlabeled Data For Abstractive Speech Summarization With Self-supervised Learning And Back-summarization (2020)2.26
- Realizing Video Summarization From The Path Of Language-based Semantic Understanding (2024)0.00
- Vt-ssum: A Benchmark Dataset For Video Transcript Segmentation And Summarization (2021)2.76