Saladnet: Self-attentive Multisource Localization In The Ambisonics Domain
2021 Β· Pierre-Amaury Grumiaux, Srdan Kitic, Prerak Srivastava, et al.
Abstract
In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data, with up to 3 simultaneous speakers. The obtained results indicate that the majority of the proposed architectures either perform on par, or outperform the CRNN baseline, especially in the multisource scenario. Moreover, by avoiding the recurrent layers, the proposed models lend themselves to parallel computing, which is shown to produce considerable savings in execution time.
Authors
(none)
Tags
Stats
Related papers
- Self Multi-head Attention For Speaker Recognition (2019)13.84
- Multi-speaker Localization Using Convolutional Neural Network Trained With Noise (2017)0.00
- Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization (2023)6.77
- State-of-the-art Speech Recognition Using Multi-stream Self-attention With Dilated 1D Convolutions (2019)11.93
- Dilated U-net Based Approach For Multichannel Speech Enhancement From First-order Ambisonics Recordings (2020)0.00
- Multichannel Long-term Streaming Neural Speech Enhancement For Static And Moving Speakers (2024)16.05
- Attention-based Neural Beamforming Layers For Multi-channel Speech Recognition (2021)0.00
- Audio Inputs For Active Speaker Detection And Localization Via Microphone Array (2023)0.00