Abstract

Emotion Detection and Voice-Emotion Conversions Using Deep Learning


Abstract


Emotion, especially through speech, is a powerful tool humans possess that conveys much more information than any text can describe. Using artificial intelligence to tap into this can have a big positive impact on a variety of industries, including audio mining, customer service applications, security, and forensics, and more. A growing field of research, spoken emotion recognition, has relied heavily on models that employ audio data to create effective classifiers. This paper presents a convolutional neural network as a deep learning classification algorithm to classify 7 emotions with an accuracy of 69.45% on the combined datasets of Savee, Ravdess and Tess. It proposes a new system to help replicate the emotions on a neutral audio (voice conversion). The production of the emotional audio is implemented using MelGAN, a special type of Generative Adversarial Network (GAN).




Keywords


ASR; CNN; Discriminator; GAN; Generator; Mel; MelGAN; MFCC; MLP; SER; SVM.