上海大学学报(自然科学版) ›› 2023, Vol. 29 ›› Issue (1): 24-40.doi: 10.12066/j.issn.1007-2861.2332

• 研究论文 • 上一篇    下一篇

基于增量方法的卷积语音情感识别网络

朱永华, 冯天宇, 张美贤, 张文俊()   

  1. 上海大学 上海电影学院, 上海 200072
  • 收稿日期:2021-03-20 出版日期:2023-02-28 发布日期:2023-03-28
  • 通讯作者: 张文俊 E-mail:wjzhang@shu.edu.cn
  • 作者简介:张文俊(1959—), 男, 教授, 博士生导师, 博士, 研究方向为数字媒体技术与应用、数字新媒体、网络通信技术及计算电磁学等. E-mail: wjzhang@shu.edu.cn

Convolutional speech emotion recognition network based on incremental method

ZHU Yonghua, FENG Tianyu, ZHANG Meixian, ZHANG Wenjun()   

  1. Shanghai Film Academy, Shanghai University, Shanghai 200072, China
  • Received:2021-03-20 Online:2023-02-28 Published:2023-03-28
  • Contact: ZHANG Wenjun E-mail:wjzhang@shu.edu.cn

摘要:

提出了一种新颖的语音情感识别结构, 从声音文件中提取梅尔频率倒谱系数(Mel-scale frequency cepstral coefficients, MFCCs)、 线性预测倒谱系数(linear predictive cepstral coefficients, LPCCs)、色度图、梅尔尺度频谱图、Tonnetz 表示和频谱对比度特征, 并将其作为一维卷积神经网络(convolutional neural network, CNN) 的输入. 构建由一维卷积层、Dropout 层、批标准化层、权重池化层、全连接层和激活层组成的网络, 并使用Ryerson 情感说话/歌唱视听(Ryerson audio-visual database of emotional speech and song, RAVDESS) 数据集、柏林语音数据集(Berlin emotional database, EMO-DB)、交互式情绪二元运动捕捉 (interactive emotional dyadic motion capture, IEMOCAP) 数据集这3 个数据集的样本来识别情感. 为提高分类精度, 利用增量方法修改初始模型. 为解决网络自动处理情感信息在话语中分布不均匀的问题, 采用了一种基于注意力机制的加权池方法来生成更有效的话语级表征. 实验结果显示: 该模型在RAVDESS 和IEMOCAP 数据集上的性能都优于已有的方法; 对于EMO-DB, 该模型仅次于一种基线方法, 但其在通用性、简单性和适用性方面都具有优势.

关键词: 语音情感识别, 卷积神经网络, 注意力机制

Abstract:

A new speech emotion recognition structure was proposed, which extracted Mel-scale frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCCs), chromaticity diagrams, Mel-scale spectrograms, Tonnetz representations and spectral contrast features from sound files. It was uesd as the input of one-dimensional convolutional neural network (CNN). A network was constructed consisting of one-dimensional convolutional layer, Dropout layer, batch normalization layer, weight pooling layer, fully connected layer and activation layer. The samples of RAVDESS(Ryerson audio-visual database of emotional speech and song), EMO-DB(Berlin emotional database) and IEMOCAP (interactiveemotionaldyadic motioncapture) data sets were used to identify emotions. In order to improve the classification accuracy, an incremental method was used to modify the initial model. In order to enable the network to automatically deal with the uneven distribution of emotional information in the discourse, a weighted pool method based on the attention mechanism was used to generate more effective discourse-level representations. Experimental results showed that the performance of this model was better than existing methods on the RAVDESS and IEMOCAP data sets. For the EMO-DB, it had advantages in versatility, simplicity and applicability.

Key words: speech emotion recognition, convolutional neural network(CNN), attention mechanism

中图分类号: