基于关联语义链网络的文本聚类方法

doi:10.3969/j.issn.1007-2861.2013.07.003

上海大学学报(自然科学版) ›› 2014, Vol. 20 ›› Issue (2): 190-198.doi: 10.3969/j.issn.1007-2861.2013.07.003

基于关联语义链网络的文本聚类方法

何祥, 骆祥峰

(上海大学计算机工程与科学学院, 上海 200444)

出版日期:2014-04-26 发布日期:2014-04-26
通讯作者: 骆祥峰(1970—), 男, 研究员, 博士, 研究方向为海量网络信息处理、认知信息学与人工智能等. E-mail: luoxf@shu.edu.cn
作者简介:骆祥峰(1970—), 男, 研究员, 博士, 研究方向为海量网络信息处理、认知信息学与人工智能等.
基金资助:
国家自然科学基金资助项目(61071110)

Document Clustering Method Based on Association Link Network

HE Xiang, LUO Xiang-feng

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

Online:2014-04-26 Published:2014-04-26

摘要/Abstract

摘要： 基于关联语义链网络提出了一种自适应分裂的文本聚类方法. 该方法通过从关联语义链网络中检测出各个社团结构作为文本集中的类别, 以避免对聚类数目的预先确定. 同时, 针对高维稀疏的词向量导致的文本之间或文本与类之间相似性低的问题, 将关联语义链网络中词与词之间的关联关系映射到文本与类之间的关联关系中去, 以增强文本与类之间关系的强度. 通过与其他主要聚类方法进行实验对比, 发现该聚类方法不仅能够对文本集合进行准确的聚类, 而且能够较准确地确定聚类中心数目和识别出文本集中的话题信息.

关键词: 关联语义链网络, 社区检测, 文本聚类

Abstract: This paper proposes a document clustering method with adaptive divisions based on association link network. Instead of explicitly offering the number of cluster centers in the traditional document clustering algorithms, categories were acquired auto- matically by detecting the community structure in association link network. Simultane- ously, with the consideration of the high-dimension and sparse word vectors that result in low similarities between the documents, the relationships were mapped between words in association link network to the relationships between the documents. Through the experimental comparisons with other clustering methods, it was found that the proposed clustering method not only obtains a high aggregation accuracy, but also are good at adap- tively discovering the number of cluster centers and distinguishing categories of topics.

Key words: association link network, community detection, document clustering

中图分类号:

TP 391

何祥, 骆祥峰. 基于关联语义链网络的文本聚类方法[J]. 上海大学学报(自然科学版), 2014, 20(2): 190-198.

HE Xiang, LUO Xiang-feng. Document Clustering Method Based on Association Link Network[J]. Journal of Shanghai University（Natural Science Edition）, 2014, 20(2): 190-198.

参考文献

[1] MacQueen J. Some methods for classification and analysis of multivariate observations [C]// Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 1967:14.
[2] 张世博. 基于优化初始中心点的 K-means 文本聚类算法[J]. 计算机与数字工程, 2011, 39(10):30-31.

[3] 张霞, 王素贞, 尹怡欣, 等. 基于模糊粒度 K-means 文本聚类算法研究 [J]. 计算机科学, 2010, 37(2):209-211.

[4] 汪中, 刘贵全, 陈恩红. 一种优化初始中心点的 K-means 算法 [J]. 模式识别与人工智能, 2009,22(2): 299-304.

[5] Defays D. An efficient algorithm for a complete link method [J]. The Computer Journal, 1977,20(4): 364-366.

[6] Fung B C, Wang K, Ester M. Hierarchical document clustering using frequent itemsets [C]// Proceedings of the SIAM International Conference on Data Mining. 2003: 59-70.

[7] 常鹏, 冯楠, 马辉. 一种基于词共现的文档聚类算法 [J]. 计算机工程, 2012, 38(2): 213-214.

[8] Bakus J, Hussin M, Kamel M. A SOM-based document clustering using phrases [C]// Proceedings of the 9th International Conference. 2002: 2212-2216.

[9] Romero F P, Peralta A, Soto A, et al. Fuzzy optimized self-organizing maps and their application to document clustering [J]. Soft Computing-A Fusion of Foundations, Methodologies and Applications, 2010, 14(8): 857-867.

[10] 张立文, 徐家宁, 李进, 等. 基于免疫网络和 SOM 的文本聚类算法研究 [J]. 计算机应用与软件,2010, 27(5): 118-120.

[11] Luo X, Xu Z, Yu J, et al. Building association link network for semantic link on web resources [J]. Automation Science and Engineering, 2011, 8(3): 482-494.

[12] Luo X, Yan K, Chen X. Automatic discovery of semantic relations based on association rule [J]. Journal of Software, 2008, 3(8): 11-18.

[13] Xu Z, Luo X, Lu W. Association link network: an incremental semantic data model on orga- nizing web resources [C]// Proceeding ICPAD’09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems. 2009: 793-798.

[14] Raghavan U N, Albert R, Kumara S. Near linear time algorithm to detect community struc- tures in large-scale networks [J]. Physical Review E, 2007, 76: 036106.

[15] Danon L, D´?az-Guilera A, Duch J, et al. Comparing community structure identification [J]. Journal of Statistical Mechanics: Theory and Experiment, 2005: 09008.

[16] Vinh N X, Epps J, Bailey J. Information theoretic measures for clusterings comparison: vari- ants, properties, normalization and correction for chance [J]. The Journal of Machine Learning Research, 2010, 11(10): 2837-2854.

[17] Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques [C]// KDD Workshop on Text Mining. 2000: 525-526.

[18] Rosenberg A, Hirschberg J. V -measure: a conditional entropy-based external cluster evalu- ation measure [C]// Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007:410-420.

基于关联语义链网络的文本聚类方法

Document Clustering Method Based on Association Link Network

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	申文锋, 栗风永, 张新鹏, 杨乾星. 保持直方图特性的隐写嵌入[J]. 上海大学学报(自然科学版), 2015, 21(2): 190-196.
[2]	杨乾星, 栗风永, 张新鹏, 申文锋. 基于DCT系数修改的自适应稳健可逆信息隐藏[J]. 上海大学学报(自然科学版), 2014, 20(5): 605-611.
[3]	石娟1, 王得宇1,2, 唐刚1. 处理光学捕获运动数据的算法[J]. 上海大学学报(自然科学版), 2014, 20(4): 489-497.
[4]	陈刚, 钱振兴, 王朔中. 保持纹理细节的自适应非局部均值图像降噪[J]. 上海大学学报(自然科学版), 2014, 20(1): 99-106.
[5]	任慧, 栗风永, 张新鹏, 余江. 结合重排序和直方图平移的调色板图像可逆信息隐藏[J]. 上海大学学报(自然科学版), 2013, 19(3): 254-258.
[6]	李娟娟1, 张金艺1,2,3, 张秉煜1, 周荣俊2, 唐夏2. 蓝牙4.0标准规范下的模糊指纹定位算法[J]. 上海大学学报(自然科学版), 2013, 19(2): 126-131.
[7]	马志鹏, 栗风永, 张新鹏. 基于汉明码与从属像素补偿的半色调图像信息隐藏[J]. 上海大学学报(自然科学版), 2013, 19(2): 111-115.
[8]	董贺, 徐凌宇. 基于云平台的软件服务流体系结构[J]. 上海大学学报(自然科学版), 2013, 19(1): 14-20.
[9]	裴蓓,王朔中,倪丽佳. 面向基于内容图像检索的图像感知Hash[J]. 上海大学学报(自然科学版), 2012, 18(4): 335-341.
[10]	陈岳军, 孙广玲, 姚恒. 结合小波金字塔的空频域亚像素图像配准[J]. 上海大学学报(自然科学版), 2012, 18(4): 342-348.
[11]	徐彤阳，方勇. 基于lαβ空间和抗混叠Contourlet变换的遥感图像融合算法[J]. 上海大学学报(自然科学版), 2012, 18(3): 221-226.
[12]	刘凯，扈文斌. 动态阈值模糊检测在篡改图像检测中的应用[J]. 上海大学学报(自然科学版), 2011, 17(5): 586-590.
[13]	李毅，王远弟. 基于核密度估计的图像平滑的最优停止[J]. 上海大学学报(自然科学版), 2011, 17(1): 103-110.
[14]	冯国瑞，戴宁街. 基于采样预测的安全可逆水印算法[J]. 上海大学学报(自然科学版), 2010, 16(6): 603-607.
[15]	王军华，方勇. 基于Curvelet域自适应数学形态学降噪的含噪图像盲分离方法[J]. 上海大学学报(自然科学版), 2010, 16(4): 336-341.