收稿日期: 2022-03-15
网络出版日期: 2022-05-27
基金资助
国家重点研发计划资助项目(2018YFB0704400);云南省重大科技专项资助项目(202102AB080019-3);云南省重大科技专项资助项目(202002AB080001-2);之江实验室科研攻关资助项目(2021PE0AC02);上海张江国家自主创新示范区专项发展资金重大资助项目(ZJ2021-ZD-006)
Material data named entity recognition based on matching contextual lexical words and graph convolution
Received date: 2022-03-15
Online published: 2022-05-27
材料领域的文献中蕴含着丰富的知识, 利用机器学习和自然语言处理等手段对文献进行数据挖掘是研究热点. 命名实体识别(named entity recognition, NER)是高效利用挖掘和抽取数据中信息的首要步骤. 为了解决现有实体识别方法中存在的向量表示无法解决一词多义、模型常提取上下文特征而忽略全局特征等问题, 提出了一种基于上下文词汇匹配和图卷积命名实体识别方法. 该方法首先利用 XLNet 获取文本的上下文动态特征, 其次利用长短期记忆网络并结合文本上下文匹配词汇的图卷积神经网络(graph convolutional network, GCN)模型分别获取上下文特征与全局特征, 最终经过条件随机场输出标签序列. 2 种不同语料对模型进行验证的结果表明, 该方法在材料数据集上的精确率、召回率和 F1 值分别达到 90.05%、88.67% 和 89.36%, 可有效提升命名实体识别的准确率.
陈茜, 武星 . 结合上下文词汇匹配和图卷积的材料数据命名实体识别[J]. 上海大学学报(自然科学版), 2022 , 28(3) : 372 -385 . DOI: 10.12066/j.issn.1007-2861.2377
Literature pertaining to materials contain abundant information regarding data mining using machine learning and natural language processing, which is currently being investigated extensively. Named entity recognition (NER) is first performed when mining and extracting information from data such that the data can be used efficiently. As vector representation cannot solve multiple meanings of words, and models often extract contextual features while disregarding global features, a named entity recognition method based on matching contextual lexical words and graph convolution is proposed herein. First, the contextual dynamic features of text is obtained using XLNet; second, the contextual and global features are obtained using a long short-term memory network and a graph convolutional network (GCN) combined with contextual lexical words of the text, respectively. Finally, a sequence of labels is output via a conditional random field. The model is validated using two different datasets. Experimental results of the material data show that the precision, recall, and F1 score are 90.05%, 88.67%, and 89.36%, respectively, which effectively improve the named entity recognition accuracy.
| [1] | Wang W R, Jiang X, Tian S H, et al. Automated pipeline for superalloy data by text mining[EB/OL]. [2022-01-13]. https://xueshu.baidu.com/usercenter/paper/show?paperid=1t1308h0f93g0xk0d73p0xs0bb071138. |
| [2] | Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3120. |
| [3] | Pennington J, Socher R, Manning C. Glove: global vectors for word representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543. |
| [4] | Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. |
| [5] | LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code[EB/OL]. [2021-11-30]. https://xueshu.baidu.com/usercenter/paper/show?paperid=3340934c7fc4377155350acaf8632c64&site=xueshu_se&hitarticle=1&sc_from=nuc. |
| [6] | Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186. |
| [7] | Liu Y, Ott M, Goyal N, et al. RoBERTa: a robustly optimized bert pretraining approach[DB/OL]. [2021-12-30]. https://arxiv.org/abs/1907.11692v1. |
| [8] | Yang Z L, Dai Z H, Yang Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[DB/OL]. [2022-01-05]. https://arxiv.org/abs/1906.08237. |
| [9] | Grishman R. The NYU system for MUC-6 or where's the syntax?[C]// Proceedings of the 6th Conference on Messgae Understanding. 1995: 167-175. |
| [10] | Zhang Y, Yang J. Chinese NER using lattice LSTM[C]// The 56th Annual Meeting of the Association for Computational Linguistics (ACL). 2018: 1554-1564. |
| [11] | Wu F Z, Liu J X, Wu C H, et al. Neural Chinese named entity recognition via CNN-LSTM-CRF and Joint training with word segmentation[C]// The World Wide Web Conference. 2019: 3342-3348. |
| [12] | 武惠, 吕立, 于碧辉. 基于迁移学习和 BiLSTM-CRF 的中文命名实体识别[J]. 小型微型计算机系统, 2019, 40(6): 1142-1147. |
| [13] | 王红斌, 沈强, 线岩团. 融合迁移学习的中文命名实体识别[J]. 小型微型计算机系统, 2017, 38(2): 346-351. |
| [14] | 王银瑞, 彭敦陆, 陈章, 等. Trans-NER: 一种迁移学习支持下的中文命名实体识别模型[J]. 小型微型计算机系统, 2019, 40(8): 1622-1626. |
| [15] | 王栋, 李业刚, 张晓, 等. 基于准循环神经网络的中文命名实体识别[J]. 计算机工程与设计, 2020, 41(7): 2038-2043. |
| [16] | Huang Z, Wei X, Kai Y. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2022-01-05]. https://xueshu.baidu.com/usercenter/paper/show?paperid=ee0eabacde06f546fe80624e0084647b. |
| [17] | Peng M L, Xing X Y, Zhang Q, et al. Distantly supervised named entity recognition using positive-unlabeled learning[DB/OL]. [2022-01-17]. https://arxiv.org/abs/1906.01378. |
| [18] | Liu Y J, Meng F D, Zhang J C, et al. GCDT: a global context enhanced deep transition architecture for sequence labeling[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2431-2441. |
| [19] | Lison P, Hubin A, Barnes J, et al. Named entity recognition without labelled data: a weak supervision approach[DB/OL]. [2022-01-18]. https://arxiv.org/abs/2004.14723. |
| [20] | Lin B Y, Lee D H, Shen M, et al. TriggerNER: learning with entity triggers as explanations for named entity recognition[EB/OL]. [2021-12-05]. https://blog.csdn.net/comeonfly666/article/details/115421428. |
| [21] | Dai Z, Yang Z L, Yang Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[DB/OL]. [2021-12-28]. https://arxiv.org/abs/1901.02860v3. |
| [22] | Kip F T N, Welling M. Semi-supervised classification with graph convolutional networks[DB/OL]. [2021-12-01]. https://arxiv.org/abs/1609.02907. |
| [23] | Sui D, Chen Y, Liu K, et al. Leverage lexical knowledge for chinese named entity recognition via collaborative graph network[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3828-3838. |
| [24] | Lee L H, Lu Y. Multiple embeddings enhanced multi-graph neural networks for chinese healthcare named entity recognition[J]. IEEE Journal of Biomedical and Health Informatics, 2021, 25(7): 2801-2810. |
| [25] | Sang E F T K, de Meulder F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition[C]// Proceedings of the Seventh Conforence on Natual Language Learning at HLT-NAACL. 2003: 142-147. |
| [26] | Weston L, Tshitoyan V, Dagdelen J, et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature[EB/OL]. [2021-11-28]. https://xueshu.baidu.com/usercenter/paper/show?paperid=186e0410a6000eb0bh5f0ck0x0214874&site=xueshu_se. |
| [27] | Tshitoyan V, Dagdelen J, Weston L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature[J]. Nature, 2019, 571(7763): 95-98. |
| [28] | Ramshaw L A, Marcus M P. Text chunking using transformation-based learning[EB/OL]. [2022-01-20]. https://xueshu.baidu.com/usercenter/paper/show?paperid=16cd0df4ac1924fb58369d003b87acd. |
| [29] | Chiu J, Nichols E. Named entity recognition with bidirectional LSTM-CNNs[EB/OL]. [2021-10-30]. https://xueshu.baidu.com/usercenter/paper/show?paperid=216cd0df4ac1924fb58369d003b87acd&site=xueshu_se. |
| [30] | Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018: 2227-2237. |
| [31] | Martins P H, Marinho Z, Martins A. Joint learning of named entity recognition and entity linking[EB/OL]. [2022-01-24]. https://xueshu.baidu.com/usercenter/paper/show?paperid=1h4m025071620ag0hp0m08j0f8096616. |
/
| 〈 |
|
〉 |