CIF-PT: Bridging Speech And Text Representations For Spoken Language Understanding Via Continuous Integrate-and-fire Pre-training
2023 Β· Linhao Dong, Zhecheng An, Peihao Wu, et al.
Abstract
Speech or text representation generated by pre-trained models contains modal-specific information that could be combined for benefiting spoken language understanding (SLU) tasks. In this work, we propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT). It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text. It jointly performs speech-to-text training and language model distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling, respectively. We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.
Authors
(none)
Tags
Stats
Related papers
- SPLAT: Speech-language Joint Pre-training For Spoken Language Understanding (2020)10.35
- Speechut: Bridging Speech And Text With Hidden-unit For Encoder-decoder Based Speech-text Pre-training (2022)10.74
- Pre-training For Spoken Language Understanding With Joint Textual And Phonetic Representation Learning (2021)2.26
- Speechclip+: Self-supervised Multi-task Representation Learning For Speech Via CLIP And Speech-image Data (2024)0.00
- ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding (2020)9.59
- Style Attuned Pre-training And Parameter Efficient Fine-tuning For Spoken Language Understanding (2020)6.77
- On Joint Training With Interfaces For Spoken Language Understanding (2021)7.16
- Understanding Semantics From Speech Through Pre-training (2019)0.00