Phone-to-audio Alignment Without Text: A Semi-supervised Approach
2021 Β· Jian Zhu, Cong Zhang, David Jurgens
Abstract
The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.
Authors
(none)
Tags
Stats
Code
Related papers
- TIPAA-SSL: Text Independent Phone-to-audio Alignment Based On Self-supervised Learning And Knowledge Transfer (2024)0.00
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Improving Sequence-to-sequence Acoustic Modeling By Adding Text-supervision (2018)9.92
- High-quality Automatic Voice Over With Accurate Alignment: Supervision Through Self-supervised Discrete Speech Units (2023)6.34
- Learning Disentangled Phone And Speaker Representations In A Semi-supervised VQ-VAE Paradigm (2020)8.09
- Transduce And Speak: Neural Transducer For Text-to-speech With Semantic Token Prediction (2023)0.00
- Unsupervised Speech Segmentation And Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding (2021)9.92
- Ccc-wav2vec 2.0: Clustering Aided Cross Contrastive Self-supervised Learning Of Speech Representations (2022)7.81