结合上下文词汇匹配和图卷积的材料数据命名实体识别

陈茜, 武星

doi:10.12066/j.issn.1007-2861.2377

上海大学学报(自然科学版) >

2022 , Vol. 28 >Issue 3: 372 - 385

DOI: https://doi.org/10.12066/j.issn.1007-2861.2377

数据采集、数据库和数据处理

结合上下文词汇匹配和图卷积的材料数据命名实体识别

展开

1.上海大学计算机工程与科学学院, 上海 200444
2.上海大学材料基因组工程研究院材料信息与数据科学中心,上海 200444
3.之江实验室, 浙江杭州 311100

武星(1980—), 男, 教授, 博士生导师, 博士, 研究方向为多模态数据挖掘, 机器学习. E-mail: xingwu@shu.edu.cn

收稿日期: 2022-03-15

网络出版日期: 2022-05-27

基金资助

国家重点研发计划资助项目(2018YFB0704400);云南省重大科技专项资助项目(202102AB080019-3);云南省重大科技专项资助项目(202002AB080001-2);之江实验室科研攻关资助项目(2021PE0AC02);上海张江国家自主创新示范区专项发展资金重大资助项目(ZJ2021-ZD-006)

收起

Material data named entity recognition based on matching contextual lexical words and graph convolution

Expand

1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2. Center of Materials Informatics and Data Science, Materials Genome Institute, Shanghai University, Shanghai 200444, China
3. Zhejiang Laboratory, Hangzhou 311100, Zhejiang, China

Received date: 2022-03-15

Online published: 2022-05-27

Fold

摘要

材料领域的文献中蕴含着丰富的知识, 利用机器学习和自然语言处理等手段对文献进行数据挖掘是研究热点. 命名实体识别(named entity recognition, NER)是高效利用挖掘和抽取数据中信息的首要步骤. 为了解决现有实体识别方法中存在的向量表示无法解决一词多义、模型常提取上下文特征而忽略全局特征等问题, 提出了一种基于上下文词汇匹配和图卷积命名实体识别方法. 该方法首先利用 XLNet 获取文本的上下文动态特征, 其次利用长短期记忆网络并结合文本上下文匹配词汇的图卷积神经网络(graph convolutional network, GCN)模型分别获取上下文特征与全局特征, 最终经过条件随机场输出标签序列. 2 种不同语料对模型进行验证的结果表明, 该方法在材料数据集上的精确率、召回率和 F1 值分别达到 90.05%、88.67% 和 89.36%, 可有效提升命名实体识别的准确率.

关键词： 命名实体识别; XLNet; 图卷积神经网络

本文引用格式

陈茜, 武星 . 结合上下文词汇匹配和图卷积的材料数据命名实体识别[J]. 上海大学学报(自然科学版), 2022 , 28(3) : 372 -385 . DOI: 10.12066/j.issn.1007-2861.2377

Abstract

Literature pertaining to materials contain abundant information regarding data mining using machine learning and natural language processing, which is currently being investigated extensively. Named entity recognition (NER) is first performed when mining and extracting information from data such that the data can be used efficiently. As vector representation cannot solve multiple meanings of words, and models often extract contextual features while disregarding global features, a named entity recognition method based on matching contextual lexical words and graph convolution is proposed herein. First, the contextual dynamic features of text is obtained using XLNet; second, the contextual and global features are obtained using a long short-term memory network and a graph convolutional network (GCN) combined with contextual lexical words of the text, respectively. Finally, a sequence of labels is output via a conditional random field. The model is validated using two different datasets. Experimental results of the material data show that the precision, recall, and F1 score are 90.05%, 88.67%, and 89.36%, respectively, which effectively improve the named entity recognition accuracy.

Key words： named entity recognition (NER); XLNet; graph convolutional network (GCN)

参考文献

[1]	Wang W R, Jiang X, Tian S H, et al. Automated pipeline for superalloy data by text mining[EB/OL]. [2022-01-13]. https://xueshu.baidu.com/usercenter/paper/show?paperid=1t1308h0f93g0xk0d73p0xs0bb071138.
[2]	Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3120.
[3]	Pennington J, Socher R, Manning C. Glove: global vectors for word representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[4]	Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[5]	LeCun Y, Boser B, Denker J S, et al. Backpropagation applied to handwritten zip code[EB/OL]. [2021-11-30]. https://xueshu.baidu.com/usercenter/paper/show?paperid=3340934c7fc4377155350acaf8632c64&site=xueshu_se&hitarticle=1&sc_from=nuc.
[6]	Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186.
[7]	Liu Y, Ott M, Goyal N, et al. RoBERTa: a robustly optimized bert pretraining approach[DB/OL]. [2021-12-30]. https://arxiv.org/abs/1907.11692v1.
[8]	Yang Z L, Dai Z H, Yang Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[DB/OL]. [2022-01-05]. https://arxiv.org/abs/1906.08237.
[9]	Grishman R. The NYU system for MUC-6 or where's the syntax?[C]// Proceedings of the 6th Conference on Messgae Understanding. 1995: 167-175.
[10]	Zhang Y, Yang J. Chinese NER using lattice LSTM[C]// The 56th Annual Meeting of the Association for Computational Linguistics (ACL). 2018: 1554-1564.
[11]	Wu F Z, Liu J X, Wu C H, et al. Neural Chinese named entity recognition via CNN-LSTM-CRF and Joint training with word segmentation[C]// The World Wide Web Conference. 2019: 3342-3348.
[12]	武惠, 吕立, 于碧辉. 基于迁移学习和 BiLSTM-CRF 的中文命名实体识别[J]. 小型微型计算机系统, 2019, 40(6): 1142-1147.
[13]	王红斌, 沈强, 线岩团. 融合迁移学习的中文命名实体识别[J]. 小型微型计算机系统, 2017, 38(2): 346-351.
[14]	王银瑞, 彭敦陆, 陈章, 等. Trans-NER: 一种迁移学习支持下的中文命名实体识别模型[J]. 小型微型计算机系统, 2019, 40(8): 1622-1626.
[15]	王栋, 李业刚, 张晓, 等. 基于准循环神经网络的中文命名实体识别[J]. 计算机工程与设计, 2020, 41(7): 2038-2043.
[16]	Huang Z, Wei X, Kai Y. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2022-01-05]. https://xueshu.baidu.com/usercenter/paper/show?paperid=ee0eabacde06f546fe80624e0084647b.
[17]	Peng M L, Xing X Y, Zhang Q, et al. Distantly supervised named entity recognition using positive-unlabeled learning[DB/OL]. [2022-01-17]. https://arxiv.org/abs/1906.01378.
[18]	Liu Y J, Meng F D, Zhang J C, et al. GCDT: a global context enhanced deep transition architecture for sequence labeling[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2431-2441.
[19]	Lison P, Hubin A, Barnes J, et al. Named entity recognition without labelled data: a weak supervision approach[DB/OL]. [2022-01-18]. https://arxiv.org/abs/2004.14723.
[20]	Lin B Y, Lee D H, Shen M, et al. TriggerNER: learning with entity triggers as explanations for named entity recognition[EB/OL]. [2021-12-05]. https://blog.csdn.net/comeonfly666/article/details/115421428.
[21]	Dai Z, Yang Z L, Yang Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[DB/OL]. [2021-12-28]. https://arxiv.org/abs/1901.02860v3.
[22]	Kip F T N, Welling M. Semi-supervised classification with graph convolutional networks[DB/OL]. [2021-12-01]. https://arxiv.org/abs/1609.02907.
[23]	Sui D, Chen Y, Liu K, et al. Leverage lexical knowledge for chinese named entity recognition via collaborative graph network[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3828-3838.
[24]	Lee L H, Lu Y. Multiple embeddings enhanced multi-graph neural networks for chinese healthcare named entity recognition[J]. IEEE Journal of Biomedical and Health Informatics, 2021, 25(7): 2801-2810.
[25]	Sang E F T K, de Meulder F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition[C]// Proceedings of the Seventh Conforence on Natual Language Learning at HLT-NAACL. 2003: 142-147.
[26]	Weston L, Tshitoyan V, Dagdelen J, et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature[EB/OL]. [2021-11-28]. https://xueshu.baidu.com/usercenter/paper/show?paperid=186e0410a6000eb0bh5f0ck0x0214874&site=xueshu_se.
[27]	Tshitoyan V, Dagdelen J, Weston L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature[J]. Nature, 2019, 571(7763): 95-98.
[28]	Ramshaw L A, Marcus M P. Text chunking using transformation-based learning[EB/OL]. [2022-01-20]. https://xueshu.baidu.com/usercenter/paper/show?paperid=16cd0df4ac1924fb58369d003b87acd.
[29]	Chiu J, Nichols E. Named entity recognition with bidirectional LSTM-CNNs[EB/OL]. [2021-10-30]. https://xueshu.baidu.com/usercenter/paper/show?paperid=216cd0df4ac1924fb58369d003b87acd&site=xueshu_se.
[30]	Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018: 2227-2237.
[31]	Martins P H, Marinho Z, Martins A. Joint learning of named entity recognition and entity linking[EB/OL]. [2022-01-24]. https://xueshu.baidu.com/usercenter/paper/show?paperid=1h4m025071620ag0hp0m08j0f8096616.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献