Efficient Speech Translation With Dynamic Latent Perceivers
2022 · Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, et al.
Abstract
Transformers have been the dominant architecture for Speech Translation in recent years, achieving significant improvements in translation quality. Since speech signals are longer than their textual counterparts, and due to the quadratic complexity of the Transformer, a down-sampling step is essential for its adoption in Speech Translation. Instead, in this research, we propose to ease the complexity by using a Perceiver encoder to map the speech inputs to a fixed-length latent representation. Furthermore, we introduce a novel way of training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent spaces without any additional computational overhead. Speech-to-Text Perceivers with DLA can match the performance of Transformer baselines across three language pairs in MuST-C. Finally, a DLA-trained model is easily adaptable to DLA at inference, and can be flexibly deployed with various computational budgets, without significant drops in translation quality.
Authors
(none)
Tags
Stats
Related papers
- Speechformer: Reducing Information Loss In Direct Speech Translation (2021)7.16
- Implicit Memory Transformer For Computationally Efficient Simultaneous Speech Translation (2023)0.00
- Dual-decoder Transformer For Joint Automatic Speech Recognition And Multilingual Speech Translation (2020)13.73
- Paraformer: Fast And Accurate Parallel Transformer For Non-autoregressive End-to-end Speech Recognition (2022)15.10
- Ditto-tts: Diffusion Transformers For Scalable Text-to-speech Without Domain-specific Factors (2024)0.00
- Latent Speech-text Transformer (2025)3.04
- Multiformer: A Head-configurable Transformer-based Model For Direct Speech Translation (2022)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08