Abstract

Speech emotion recognition (SER) has attracted great attention in recent years due to the high demand for emotionally intelligent speech interfaces. Deriving speaker-invariant representations for speech emotion recognition is crucial. In this paper, we propose to apply adversarial training to SER to learn speaker-invariant representations. Our model consists of three parts: a representation learning sub-network with time-delay neural network (TDNN) and LSTM with statistical pooling, an emotion classification network and a speaker classification network. Both the emotion and speaker classification network take the output of the representation learning network as input. Two training strategies are employed: one based on domain adversarial training (DAT) and the other one based on cross-gradient training (CGT). Besides the conventional data set, we also evaluate our proposed models on a much larger publicly available emotion data set with 250 speakers. Evaluation results show that on IEMO

Authors

(none)

Tags

  • Speech Recognition

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keytu2019towards

Related papers