Deep Learning-Based System for Automatic Speech Recognition (ASR) in Noisy Environments

Abstract

This study offers a design and implementation of a deep learning-based automatic speech recognition (ASR) system that can work well in noisy areas and at the same time recognize the emotional condition of the speaker. The suggested system uses recurrent neural networks (RNNs) to handle sequential acoustic signals and, therefore, allows one to precisely transcribe speech and detect emotional signals, including pitch, tone, and rhythm changes. Noise-reduction and feature-extraction methods like Mel-Frequency Cepstral Coefficients (MFCCs) are used to preprocess audio signals to make them more clear and send them to the ASR and emotion detection models. The interaction between the user and the web interface is a Django-based web interface that enables real-time interaction where the speech that is recorded is processed to produce transcribed text and emotion classification. Experimental testing proves that the combined system is better recognized to work well in noisy environments and that it successfully identifies emotions such as happiness in sad feelings in anger and neutrality. Such a two-purpose framework underlines the possibility to use ASR and emotion recognition together to improve human-computer interaction, assistive technologies, and affective communication models and support more human-centered and context-sensitive uses of AI.

Abstract

Related papers