Deep Triphone Embedding Improves Phoneme Recognition
2017 Β· Mohit Yadav, Vivek Tyagi
Abstract
In this paper, we present a novel Deep Triphone Embedding (DTE) representation derived from Deep Neural Network (DNN) to encapsulate the discriminative information present in the adjoining speech frames. DTEs are generated using a four hidden layer DNN with 3000 nodes in each hidden layer at the first-stage. This DNN is trained with the tied-triphone classification accuracy as an optimization criterion. Thereafter, we retain the activation vectors (3000) of the last hidden layer, for each speech MFCC frame, and perform dimension reduction to further obtain a 300 dimensional representation, which we termed as DTE. DTEs along with MFCC features are fed into a second-stage four hidden layer DNN, which is subsequently trained for the task of tied-triphone classification. Both DNNs are trained using tri-phone labels generated from a tied-state triphone HMM-GMM system, by performing a forced-alignment between the transcriptions and MFCC feature frames. We conduct the experiments on publicly
Authors
(none)
Tags
Stats
Related papers
- Learning Acoustic Word Embeddings With Phonetically Associated Triplet Network (2018)0.00
- Bertphone: Phonetically-aware Encoder Representations For Utterance-level Speaker And Language Recognition (2019)13.27
- Phoneme Based Neural Transducer For Large Vocabulary Speech Recognition (2020)9.59
- Multilingual And Crosslingual Speech Recognition Using Phonological-vector Based Phone Embeddings (2021)7.16
- Phonetic Temporal Neural Model For Language Identification (2017)12.40
- Robust Vocal Quality Feature Embeddings For Dysphonic Voice Detection (2022)7.16
- Triplet Network With Attention For Speaker Diarization (2018)7.16
- MGFF-TDNN: A Multi-granularity Feature Fusion TDNN Model With Depth-wise Separable Module For Speaker Verification (2025)0.00