Acckv: Towards Efficient Audio-video Llms Inference Via Adaptive-focusing And Cross-calibration KV Cache Optimization
2025 Β· Zhonghua Jiang, Kui Chen, Kunxi Li, et al.
Abstract
Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting
Authors
(none)
Tags
Stats
Related papers
- Quality Over Quantity? Llm-based Curation For A Data-efficient Audio-video Foundation Model (2025)0.00
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Improved Lite Audio-visual Speech Enhancement (2020)11.39
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24