Cascaded ASR-Transformer Framework for Audio-Based Hate Speech Detection

Abstract

The increased voice based digital communication across the social media platforms has highlighted the need for an effective detection of hate speech in spoken content. Most existing moderation systems are focused on textual data, leaving a significant gap in addressing speech based hate speech. This paper presents multistage transformer driven framework for detecting the hate speech in audio content. The proposed approach employs a Whisper based ASR model for the conversion of speech signal to text, then it is next analyzed using the fine tuned BERT model for the contextual classification. The Fine tuning is performed by updating the transformer layers and classification head to adapt semantic representations to the target task. The system has achieved the accuracy of 83.59%. A custom dataset was curated through controlled audio generation and exact preprocessing to improve the linguistic consistency. Experimental results demonstrate stable learning behavior and enhanced performance on unseen speech samples. The results have confirmed the suitability of transformer models for the detection of Hate speech in audio data.

Abstract

Related papers