Tokenverse: Towards Unifying Speech And NLP Tasks Via Transducer-based ASR
2024 Β· Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, et al.
Abstract
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
Authors
(none)
Tags
Stats
Code
Related papers
- Tokenchain: A Discrete Speech Chain Via Semantic Token Modeling (2025)0.00
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00
- Transduce And Speak: Neural Transducer For Text-to-speech With Semantic Token Prediction (2023)0.00
- Tokensplit: Using Discrete Speech Representations For Direct, Refined, And Transcript-conditioned Speech Separation And Recognition (2023)7.50
- Improving RNN Transducer Based ASR With Auxiliary Tasks (2020)9.59
- Tranusr: Phoneme-to-word Transcoder Based Unified Speech Representation Learning For Cross-lingual Speech Recognition (2023)6.34
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Transformer-transducers For Code-switched Speech Recognition (2020)10.97