Journal of Shanghai University(Natural Science Edition) ›› 2023, Vol. 29 ›› Issue (1): 24-40.doi: 10.12066/j.issn.1007-2861.2332

• Research Articles • Previous Articles     Next Articles

Convolutional speech emotion recognition network based on incremental method

ZHU Yonghua, FENG Tianyu, ZHANG Meixian, ZHANG Wenjun()   

  1. Shanghai Film Academy, Shanghai University, Shanghai 200072, China
  • Received:2021-03-20 Online:2023-02-28 Published:2023-03-28
  • Contact: ZHANG Wenjun E-mail:wjzhang@shu.edu.cn

Abstract:

A new speech emotion recognition structure was proposed, which extracted Mel-scale frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCCs), chromaticity diagrams, Mel-scale spectrograms, Tonnetz representations and spectral contrast features from sound files. It was uesd as the input of one-dimensional convolutional neural network (CNN). A network was constructed consisting of one-dimensional convolutional layer, Dropout layer, batch normalization layer, weight pooling layer, fully connected layer and activation layer. The samples of RAVDESS(Ryerson audio-visual database of emotional speech and song), EMO-DB(Berlin emotional database) and IEMOCAP (interactiveemotionaldyadic motioncapture) data sets were used to identify emotions. In order to improve the classification accuracy, an incremental method was used to modify the initial model. In order to enable the network to automatically deal with the uneven distribution of emotional information in the discourse, a weighted pool method based on the attention mechanism was used to generate more effective discourse-level representations. Experimental results showed that the performance of this model was better than existing methods on the RAVDESS and IEMOCAP data sets. For the EMO-DB, it had advantages in versatility, simplicity and applicability.

Key words: speech emotion recognition, convolutional neural network(CNN), attention mechanism

CLC Number: