大规模数据集聚类的K邻近均匀抽样数据预处理算法

doi:10.3969/j.issn.1007-2861.2015.04.020

上海大学学报(自然科学版) ›› 2016, Vol. 22 ›› Issue (1): 28-35.doi: 10.3969/j.issn.1007-2861.2015.04.020

大规模数据集聚类的K邻近均匀抽样数据预处理算法

吉成恒, 雷咏梅

上海大学计算机工程与科学学院, 上海 200444

收稿日期:2015-11-20 出版日期:2016-02-29 发布日期:2016-02-29
通讯作者: 雷咏梅(1965—), 女, 教授, 博士生导师, 博士, 研究方向为高性能计算、大数据处理等.E-mail: lei@shu.edu.cn
基金资助:
上海市教委重点学科资助项目(12ZZ09); 上海市科委资助项目(13DZ118800)

KNN-based even sampling preprocessing algorithm for big dataset

JI Chengheng, LEI Yongmei

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

Received:2015-11-20 Online:2016-02-29 Published:2016-02-29

摘要/Abstract

摘要：

为解决基于密度的聚类算法处理大规模数据集效率低和存储开销大的问题, 提出一种分片的基于K邻近关系的空间均匀抽样算法作为聚类应用的数据预处理过程, 将数据集分片,按密度降序方式去除数据集中部分样本的K邻居, 将剩余样本作为抽样样本, 在保证精度的同时, 可以降低数据规模, 提升计算效率. 实验结果表明, 在数据规模较大且保证聚类结果准确性的前提下, 通过降低聚类数据规模, 可以有效提升聚类效率.

关键词: K邻近, 聚类, 空间均匀抽样, 密度降序

Abstract:

To solve the problem of low efficiency and high storage overheads in densitybased clustering algorithms, an algorithm of even data sampling based on K nearest neighbors (KNN) is proposed as a data preprocessing method of clustering applications. The sampling algorithm slices dataset and gets samples evenly. After slicing a dataset, for part of the samples, the algorithm removes each sample’s K nearest neighbors in a descending order according to the density. The remaining samples are then used as the sample dataset. Experimental results show that, with the increase of data size and the guaranteed accuracy, the sampling algorithm can effectively improve efficiency of clustering by reducing the amount of data needed in clustering.

Key words: K nearest neighbors (KNN), clustering, density descending order, spatial even sampling

吉成恒, 雷咏梅. 大规模数据集聚类的K邻近均匀抽样数据预处理算法[J]. 上海大学学报(自然科学版), 2016, 22(1): 28-35.

JI Chengheng, LEI Yongmei. KNN-based even sampling preprocessing algorithm for big dataset[J]. Journal of Shanghai University（Natural Science Edition）, 2016, 22(1): 28-35.

[1]	李婧, 于丽英. 基于直觉模糊集的模糊C均值聚类改进算法[J]. 上海大学学报(自然科学版), 2018, 24(4): 634-641.
[2]	郭鹏1, 李钧2, 张海燕3. 基于云平台的智能远程种植系统[J]. 上海大学学报(自然科学版), 2017, 23(2): 244-251.
[3]	路东方, 许俊富, 项超娟, 谢江. 生物大数据中的聚类方法分析[J]. 上海大学学报(自然科学版), 2016, 22(1): 45-57.
[4]	张麒1, 黄春春1, 韩红2, 李超伦2, 王文平2. 基于多尺度模糊聚类与DGVF模型分割颈动脉超声造影图像[J]. 上海大学学报(自然科学版), 2014, 20(5): 633-644.
[5]	何祥, 骆祥峰. 基于关联语义链网络的文本聚类方法[J]. 上海大学学报(自然科学版), 2014, 20(2): 190-198.
[6]	黄文佳，冯铁男，王翼飞. 基于小波的肿瘤基因表达数据聚类分析模型[J]. 上海大学学报(自然科学版), 2011, 17(5): 624-630.
[7]	沈青松，黄文佳，吕玉龙，王翼飞. 一种糖尿病动物模型基因芯片的聚类分析[J]. 上海大学学报(自然科学版), 2010, 16(4): 409-414.
[8]	毕行，徐炜民. 基于特定群体兴趣的混合个性化推荐算法[J]. 上海大学学报(自然科学版), 2010, 16(3): 318-322.
[9]	陈俊;吴绍春;盛春健. 基于概念格的聚类分析[J]. 上海大学学报(自然科学版), 2008, 14(4): 432-435 .
[10]	夏骄雄;徐俊;吴耿锋. 基于本体核与直方图的聚类预处理方法[J]. 上海大学学报(自然科学版), 2008, 14(1): 19-25 .
[11]	徐俊;夏骄雄;李青. 用主成份提取进行数据库聚类预处理[J]. 上海大学学报(自然科学版), 2007, 13(6): 705-710 .
[12]	丁友东;杜晓凤;李晓强. 基于聚类肤色模型的人脸检测[J]. 上海大学学报(自然科学版), 2007, 13(5): 511-515 .

大规模数据集聚类的K邻近均匀抽样数据预处理算法

KNN-based even sampling preprocessing algorithm for big dataset

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价