上海大学学报(自然科学版) ›› 2024, Vol. 30 ›› Issue (3): 476-490.doi: 10.12066/j.issn.1007-2861.2449

• • 上一篇    下一篇

基于生成对抗网络数据增强的抗噪语音识别系统

冯天宇, 朱永华   

  1. 上海大学 上海电影学院, 上海 200072
  • 出版日期:2024-06-30 发布日期:2024-07-09
  • 通讯作者: 朱永华 (1967—), 男, 副教授, 博士, 研究方向为网络规划与设计、高性能计算、信息与通信工程和智能控制等 E-mail:zyh@shu.edu.cn

Anti-noise speech recognition system based on generative adversarial network data enhancement

FENG Tianyu, ZHU Yonghua   

  1. Shanghai Film Academy, Shanghai University, Shanghai 200072, China
  • Online:2024-06-30 Published:2024-07-09

摘要: 语音识别的研究始终存在数据集具有局限性的问题. 通过数据增强可以提升训练数据的规模以及多样性, 从而提升识别的准确率. 提出了一种基于生成对抗网络 (generative adversarial network, GAN) 的语音数据生成方法, 以改善噪声条件下的语音识别. 首先, 使用基础的 GAN 结构, 逐帧生成基于光谱特征水平的语音样本; 之后, 针对缺乏真实标签用于训练的问题, 又提出了一种利用非转录数据进行声学建模的无监督学习框架, 并利用条件 GAN结构探讨 2 种条件: 每个语音帧的声学状态和与数据集中语音对应的原始干净语音. 整合了条件信息的条件 GAN 可以直接提供真实标签用于声学建模. 该方法在 2 个噪声任务 (Aurora-4和 AMI 会议转录任务) 上进行了评估. 研究结果表明, 在各种噪声条件 (加性噪声、信道失真和混响) 下, 该方法都能显著提升性能. GAN 生成的增强数据在先进的非常深度卷积神经网络(very deep convolutional network, VDCNN) 声学模型上, 可以降低6%∼14% 的字错误率(word error rate, WER).

关键词: 生成对抗网络, 声学模型, 数据增强, 噪声, 语音识别

Abstract: Research on speech recognition is always challenged by the limitations of the dataset. Data enhancement can improve the scale and diversity of training data, thereby improving the accuracy of speech recognition. In this paper, a speech data generation method based on generative adversarial network (GAN) is proposed for improving speech recognition in noisy environments. First, the basic GAN structure is used to generate speech samples frame by frame at the spectral feature level. Considering the lack of real labels for training, an unsupervised learning framework is proposed for acoustic modeling using non-transcribed data, whereby the conditional GAN structure is used to explore two conditions: the acoustic state of each speech frame and original clean speech corresponding to the speech in the dataset. GANs that incorporate conditional information can directly provide real labels for acoustic modeling. The present method was evaluated on the noisy Aurora-4 and AMI conference transcription tasks. Experimental results show that the new method can significantly improve the performance under various noise conditions (additive noise, channel distortion, and reverberation). The enhanced data generated by GAN reduced the word error rate (WER) by 6%∼14% on the advanced very deep convolutional neural network (VDCNN) acoustic model. 

Key words: generate adversarial network, acoustic model, data enhancement, noise, speech recognition

中图分类号: