This paper proposes a document clustering method with adaptive divisions based on association link network. Instead of explicitly offering the number of cluster centers in the traditional document clustering algorithms, categories were acquired auto- matically by detecting the community structure in association link network. Simultane- ously, with the consideration of the high-dimension and sparse word vectors that result in low similarities between the documents, the relationships were mapped between words in association link network to the relationships between the documents. Through the experimental comparisons with other clustering methods, it was found that the proposed clustering method not only obtains a high aggregation accuracy, but also are good at adap- tively discovering the number of cluster centers and distinguishing categories of topics.
HE Xiang, LUO Xiang-feng
. Document Clustering Method Based on Association Link Network[J]. Journal of Shanghai University, 2014
, 20(2)
: 190
-198
.
DOI: 10.3969/j.issn.1007-2861.2013.07.003
[1] MacQueen J. Some methods for classification and analysis of multivariate observations [C]// Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 1967:14.
[2] 张世博. 基于优化初始中心点的 K-means 文本聚类算法[J]. 计算机与数字工程, 2011, 39(10):30-31.
[3] 张霞, 王素贞, 尹怡欣, 等. 基于模糊粒度 K-means 文本聚类算法研究 [J]. 计算机科学, 2010, 37(2):209-211.
[4] 汪 中, 刘 贵 全, 陈 恩 红. 一 种 优 化 初 始 中 心 点 的 K-means 算 法 [J]. 模 式 识 别 与 人 工 智 能, 2009,22(2): 299-304.
[5] Defays D. An efficient algorithm for a complete link method [J]. The Computer Journal, 1977,20(4): 364-366.
[6] Fung B C, Wang K, Ester M. Hierarchical document clustering using frequent itemsets [C]// Proceedings of the SIAM International Conference on Data Mining. 2003: 59-70.
[7] 常鹏, 冯楠, 马辉. 一种基于词共现的文档聚类算法 [J]. 计算机工程, 2012, 38(2): 213-214.
[8] Bakus J, Hussin M, Kamel M. A SOM-based document clustering using phrases [C]// Proceedings of the 9th International Conference. 2002: 2212-2216.
[9] Romero F P, Peralta A, Soto A, et al. Fuzzy optimized self-organizing maps and their application to document clustering [J]. Soft Computing-A Fusion of Foundations, Methodologies and Applications, 2010, 14(8): 857-867.
[10] 张 立 文, 徐 家 宁, 李 进, 等. 基 于 免 疫 网 络 和 SOM 的 文 本 聚 类 算 法 研 究 [J]. 计 算 机 应 用 与 软 件,2010, 27(5): 118-120.
[11] Luo X, Xu Z, Yu J, et al. Building association link network for semantic link on web resources [J]. Automation Science and Engineering, 2011, 8(3): 482-494.
[12] Luo X, Yan K, Chen X. Automatic discovery of semantic relations based on association rule [J]. Journal of Software, 2008, 3(8): 11-18.
[13] Xu Z, Luo X, Lu W. Association link network: an incremental semantic data model on orga- nizing web resources [C]// Proceeding ICPAD’09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems. 2009: 793-798.
[14] Raghavan U N, Albert R, Kumara S. Near linear time algorithm to detect community struc- tures in large-scale networks [J]. Physical Review E, 2007, 76: 036106.
[15] Danon L, D´?az-Guilera A, Duch J, et al. Comparing community structure identification [J]. Journal of Statistical Mechanics: Theory and Experiment, 2005: 09008.
[16] Vinh N X, Epps J, Bailey J. Information theoretic measures for clusterings comparison: vari- ants, properties, normalization and correction for chance [J]. The Journal of Machine Learning Research, 2010, 11(10): 2837-2854.
[17] Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques [C]// KDD Workshop on Text Mining. 2000: 525-526.
[18] Rosenberg A, Hirschberg J. V -measure: a conditional entropy-based external cluster evalu- ation measure [C]// Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007:410-420.