Implementation of Speaker Identification and Speaker Emotion Recognition System

  • Ravi Shankar. D Manjula R. B.
Keywords: Convolutional Neural Network (CNN), Equal Error rate, Speaker Authentication, MFCC, LSTM.

Abstract

Audio classification incurs unique difficulties in speaker recognition and human emotion detection, which have applicable relevance to the real world. This paper introduces a novel multimodal solution to the two challenges of speaker verification and sentiment detection in a customer service call centre setting. For speaker recognition, utilizing a small subset of the LibriSpeech Library, features are extracted via Mel-frequency cepstral coefficients (MFCCs). A three-layer Long Short-Term Memory (LSTM) architecture using triplet loss for training produces an Equal Error Rate (EER) of 6.89%, demonstrating efficacy and precision. Simultaneously, we also conduct emotion detection on the RAVDESS dataset via CNN to classify eight feelings the emotions proposed by Ekman, plus neutral and relaxed resulting in an F1 score of 0.85. This contribution demonstrates that such deep learning approaches can be applied in the real world for telephone speaker authentication and help centers, as speaker verification and emotion detection provide additional meaning to what is being conveyed.

Published
2024-02-04
Section
Regular Issue