Journal of Shanghai University(Natural Science Edition) ›› 2016, Vol. 22 ›› Issue (1): 58-68.doi: 10.3969/j.issn.1007-2861.2015.04.016

Previous Articles     Next Articles

Product matching based on Internet and its implementation

GU Qi1,2, ZHU Can1, CAO Jian1   

  1. 1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. School of Computer Science and Technology, Nantong University, Nantong 226019, Jiangsu, China
  • Received:2015-11-30 Online:2016-02-29 Published:2016-02-29

Abstract:

Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

Key words: big data, entity resolution, product matching, unstructured data