收稿日期: 2015-11-20
网络出版日期: 2016-02-29
基金资助
上海市教委重点学科资助项目(12ZZ09); 上海市科委资助项目(13DZ118800)
KNN-based even sampling preprocessing algorithm for big dataset
Received date: 2015-11-20
Online published: 2016-02-29
吉成恒, 雷咏梅 . 大规模数据集聚类的K邻近均匀抽样数据预处理算法[J]. 上海大学学报(自然科学版), 2016 , 22(1) : 28 -35 . DOI: 10.3969/j.issn.1007-2861.2015.04.020
To solve the problem of low efficiency and high storage overheads in densitybased clustering algorithms, an algorithm of even data sampling based on K nearest neighbors (KNN) is proposed as a data preprocessing method of clustering applications. The sampling algorithm slices dataset and gets samples evenly. After slicing a dataset, for part of the samples, the algorithm removes each sample’s K nearest neighbors in a descending order according to the density. The remaining samples are then used as the sample dataset. Experimental results show that, with the increase of data size and the guaranteed accuracy, the sampling algorithm can effectively improve efficiency of clustering by reducing the amount of data needed in clustering.
[1] Jain A K, Murty M N, Flynn P J. Data clustering: a review [J]. ACM Computing Surveys,1999, 31(2): 264-323.
[2] 孙吉贵, 刘杰, 赵连宇. 聚类算法研究[J]. 软件学报, 2008, 19(1): 48-61.
[3] Kriegel H P, Kr¨oger P, Ser J, et al. Density-based clustering [J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2011, 1(3): 231-240.
[4] Sakai T, Tamura K, Kitakami H. A new density-based spatial clustering algorithm for extracting attractive local regions in georeferenced documents [J]. Lecture Notes in Engineering
and Computer Science, 2014, 2209(1): 360-365.
[5] Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings 2nd International Conference on Knowledge Discovery and Data Mining. 1996: 226-231.
[6] Rodriguez A, Laio A. Clustering by fast search and find of density peaks [J]. Science, 2014, 344(6191): 1492-1496.
[7] 胡彩平, 秦小麟.一种改进的基于密度的抽样聚类算法[J].中国图象图形学报, 2007(11): 2031-2036.
[8] 周兵, 沈钧毅, 彭勤科. 基于随机抽样和聚类特征的聚类算法[J]. 西安交通大学学报, 2003(12): 1234-1237.
[9] Liao K, Liu G, Xiao L, et al. A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval [J]. Knowledge-Based Systems, 2013, 49: 123-133.
[10] Hastie T, Tibshirani R. Discriminant adaptive nearest neighbor classification [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996, 18(6): 607-616.
/
| 〈 |
|
〉 |