Speechformer: Reducing Information Loss In Direct Speech Translation
2021 Β· Sara Papi, Marco Gaido, Matteo Negri, et al.
Abstract
Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en->de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low r
Authors
(none)
Tags
Stats
Related papers
- Multiformer: A Head-configurable Transformer-based Model For Direct Speech Translation (2022)0.00
- Efficient Speech Translation With Dynamic Latent Perceivers (2022)0.00
- Speechformer: A Hierarchical Efficient Framework Incorporating The Characteristics Of Speech (2022)12.99
- Speechformer++: A Hierarchical Efficient Framework For Paralinguistic Speech Processing (2023)14.43
- Redapt: An Adaptor For Wav2vec 2 Encoding \\ Faster And Smaller Speech Translation Without Quality Compromise (2022)0.00
- Translatotron 2: High-quality Direct Speech-to-speech Translation With Voice Preservation (2021)0.00
- Implicit Memory Transformer For Computationally Efficient Simultaneous Speech Translation (2023)0.00
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24