Journal of Shanghai University(Natural Science Edition) ›› 2016, Vol. 22 ›› Issue (1): 28-35.doi: 10.3969/j.issn.1007-2861.2015.04.020

Previous Articles     Next Articles

KNN-based even sampling preprocessing algorithm for big dataset

JI Chengheng, LEI Yongmei   

  1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
  • Received:2015-11-20 Online:2016-02-29 Published:2016-02-29

Abstract:

To solve the problem of low efficiency and high storage overheads in densitybased clustering algorithms, an algorithm of even data sampling based on K nearest neighbors (KNN) is proposed as a data preprocessing method of clustering applications. The sampling algorithm slices dataset and gets samples evenly. After slicing a dataset, for part of the samples, the algorithm removes each sample’s K nearest neighbors in a descending order according to the density. The remaining samples are then used as the sample dataset. Experimental results show that, with the increase of data size and the guaranteed accuracy, the sampling algorithm can effectively improve efficiency of clustering by reducing the amount of data needed in clustering.

Key words: K nearest neighbors (KNN), clustering, density descending order, spatial even sampling