Back To Supervision: Boosting Word Boundary Detection Through Frame Classification
2024 Β· Simone Carnemolla, Salvatore Calcagno, Simone Palazzo, et al.
Abstract
Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye dataset and 0.7436 on the TIMIT dataset, along with R-values of 0.8489 and 0.7807, respectively. These
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Contrastive Learning For Unsupervised Phoneme Segmentation (2020)12.68
- Segmental Contrastive Predictive Coding For Unsupervised Word Segmentation (2021)0.00
- Unsupervised Word Discovery: Boundary Detection With Clustering Vs. Dynamic Programming (2024)3.58
- Word Discovery In Visually Grounded, Self-supervised Speech Models (2022)14.08
- Improving Unsupervised Subword Modeling Via Disentangled Speech Representation Learning And Transformation (2019)5.24
- Unsupervised Speech Segmentation And Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding (2021)9.92
- Unsupervised Speech Recognition Via Segmental Empirical Output Distribution Matching (2018)0.00
- Integrating Self-supervised Speech Model With Pseudo Word-level Targets From Visually-grounded Speech Model (2024)3.58