Bertphone: Phonetically-aware Encoder Representations For Utterance-level Speaker And Language Recognition
2019 Β· Shaoshi Ling, Julian Salazar, Yuzong Liu, et al.
Abstract
We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the first, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for acoustic representation learning; the second, inspired by the success of bottleneck features from ASR, is a sequence-level CTC loss applied to phoneme labels for phonetic representation learning. We pretrain two BERTphone models (one on Fisher and one on TED-LIUM) and use them as feature extractors into x-vector-style DNNs for both tasks. We attain a state-of-the-art \(C_\{\text\{avg\}\}\) of 6.16 on the challenging LRE07 3sec closed-set language recognition task. On Fisher and VoxCeleb speaker recognition tasks, we see an 18% relative reduction in speaker EER when training on BERTphone vector
Authors
(none)
Tags
Stats
Related papers
- BERT-LID: Leveraging BERT To Improve Spoken Language Identification (2022)8.09
- Speech2phone: A Novel And Efficient Method For Training Speaker Recognition Models (2020)2.26
- Learning Disentangled Phone And Speaker Representations In A Semi-supervised VQ-VAE Paradigm (2020)8.09
- Phoneme Based Neural Transducer For Large Vocabulary Speech Recognition (2020)9.59
- Speaker Embedding Extraction With Phonetic Information (2018)11.85
- Length- And Noise-aware Training Techniques For Short-utterance Speaker Recognition (2020)0.00
- Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders (2023)6.34
- Wav-bert: Cooperative Acoustic And Linguistic Representation Learning For Low-resource Speech Recognition (2021)8.82