Audio-visual Speech Enhancement: Architectural Design And Deployment Strategies
2026 Β· Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, et al.
Abstract
arXiv:2508.08468v5 Announce Type: replace Abstract: Real-time audio-visual speech enhancement (AVSE) is a key enabler for immersive and interactive multimedia services, yet its performance is tightly constrained by network latency, uplink capacity, and computational delay. This paper presents the design, deployment, and evaluation of a complete cloud-edge-assisted AVSE system operating over a public 5G edge network. The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence, and is deployed on a Vodafone-compatible AWS Wavelength edge cloud. Through extensive stress testing, we analyze end-to-end performance under varying network load and adaptive multimedia profiles. Results show that compute placement at the network edge is critical for meeting real-time coherence constraints, and that uplink capacity is often the dominant bottleneck for interactive AVSE services. Only 5G and wired Ethern
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Lstmse-net: Long Short Term Speech Enhancement Network For Audio-visual Speech Enhancement (2024)8.57
- Improved Lite Audio-visual Speech Enhancement (2020)11.39
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement (2023)0.00
- Flowavse: Efficient Audio-visual Speech Enhancement With Conditional Flow Matching (2024)0.00
- Contextual Audio-visual Switching For Speech Enhancement In Real-world Environments (2018)14.35
- An Empirical Study Of Visual Features For DNN Based Audio-visual Speech Enhancement In Multi-talker Environments (2020)3.58