Abstract

To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc\{FedAvg\}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the whole model and performance degradation caused by data heterogeneity among clients.To address these issues, we propose a personalized federated S2T framework that introduces \textsc\{FedLoRA\}, a lightweight LoRA module for client-side tuning and interaction with the server to minimize communication overhead, and \textsc\{FedMem\}, a global model equipped with a \(k\)-nearest-neighbor (\(k\)NN) classifier that captures client-specific distributional shifts to achieve personalization and overcome data heterogeneity. Extensive experiments based on Conformer and Whisper backbone models on CoVoST

Authors

(none)

Tags

  • Speech Translation
  • Speech Recognition
  • Text-to-Speech

Stats

Related papers