Multi-time-scale Convolution For Emotion Recognition From Speech Audio Signals
2020 Β· Eric Guizzo, Tillman Weyde, Jack Barnett Leveson
Abstract
Robustness against temporal variations is important for emotion recognition from speech audio, since emotion is ex-pressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. To address this and potentially other tasks, we introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. MTS extends convolutional neural networks with convolution kernels that are scaled and re-sampled along the time axis, to increase temporal flexibility without increasing the number of trainable parameters compared to standard convolutional layers. We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes. The results show that the use of MTS layers consistently improves the generalization of networks of different capacity and
Authors
(none)
Tags
Stats
Related papers
- Multi-channel Auto-encoder For Speech Emotion Recognition (2018)0.00
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- Multi-microphone Speech Emotion Recognition Using The Hierarchical Token-semantic Audio Transformer Architecture (2024)5.24
- Temporal-frequency State Space Duality: An Efficient Paradigm For Speech Emotion Recognition (2024)7.50
- Emotech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information With Hybrid Recurrent Network (2025)8.35
- Capturing Long-term Temporal Dependencies With Convolutional Networks For Continuous Emotion Recognition (2017)10.48
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Multi-scale Octave Convolutions For Robust Speech Recognition (2019)7.16