EFFICIENT SPEECH EMOTION RECOGNITION USING LIGHTWEIGHT CNN-LSTM FUSION
Speech Emotion Recognition (SER) uses AI to understand human emotions from audio signals. These signals contain different kinds of information, like how we say things. We can break down this info into parts like continuous features, quality features, spectral features, and TEO-based features. Picking out the right features from the audio is super important in SER to get emotions right. Lots of research on SER uses spectral features like MFCCs because they’re good at catching the patterns in audio. Our study mixes continuous and spectral features to make SER systems work better. We put these features into a new model called CNNLSTM, which mixes two types of layers to catch different kinds of patterns in the audio. By doing this, we wanted to make our model better at understanding and recognizing emotions in speech. Our model performed well on the Ravdess dataset and still runs fast enough for real-time use.