BEST-STD2.0: Balanced And Efficient Speech Tokenizer For Spoken Term Detection
2025 Β· Anup Singh, Vipul Arora, Kris Demuynck
Abstract
Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.
Authors
(none)
Tags
Stats
Related papers
- BEST-STD: Bidirectional Mamba-enhanced Speech Tokenization For Spoken Term Detection (2024)2.26
- A Nonparametric Bayesian Approach For Spoken Term Detection By Example Query (2016)0.00
- Query-by-example Spoken Term Detection Using Attention-based Multi-hop Networks (2017)9.23
- Cross-lingual And Multilingual Spoken Term Detection For Low-resource Indian Languages (2020)0.00
- Cross-lingual Query-by-example Spoken Term Detection: A Transformer-based Approach (2024)0.00
- Spoken Term Detection And Relevance Score Estimation Using Dot-product Of Pronunciation Embeddings (2022)5.84
- DASB - Discrete Audio And Speech Benchmark (2024)0.00
- Stabletoken: A Noise-robust Semantic Speech Tokenizer For Resilient Speechllms (2025)2.98