Voxrag: A Step Toward Transcription-free RAG Systems In Spoken Question Answering
2025 Β· Zackary Rackauckas, Julia Hirschberg
Abstract
We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by a native English speaker. Retrieval quality was evaluated using LLM-as-a-judge annotations. For very relevant segments, cosine similarity achieved a Recall@10 of 0.34. For somewhat relevant segments, Recall@10 rose to 0.60 and nDCG@10 to 0.27, highlighting strong topical alignment. Answer quality was judged on a 0--2 scale across relevance, accuracy, completeness, and precision, with mean scores of 0.84, 0.58, 0.56, and 0.46 respectively. While precision and retrieval quality remain key limitations, VoxRAG shows that transcription-free speech-to-speech retrieval is feasible in RAG systems.
Authors
(none)
Tags
Stats
Related papers
- Rag-check: Evaluating Multimodal Retrieval Augmented Generation Performance (2025)0.00
- SRAG: RAG With Structured Data Improves Vector Retrieval (2026)0.00
- RAG Playground: A Framework For Systematic Evaluation Of Retrieval Strategies And Prompt Engineering In RAG Systems (2024)0.00
- M4-RAG: A Massive-scale Multilingual Multi-cultural Multimodal RAG (2025)2.00
- Re-ranking The Context For Multimodal Retrieval Augmented Generation (2025)0.00
- Ragsmith: A Framework For Finding The Optimal Composition Of Retrieval-augmented Generation Methods Across Datasets (2025)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Ragdb: A Zero-dependency, Embeddable Architecture For Multimodal Retrieval-augmented Generation On The Edge (2025)0.00