Product matching based on Internet and its implementation

Expand
  • 1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. School of Computer Science and Technology, Nantong University, Nantong 226019, Jiangsu, China

Received date: 2015-11-30

  Online published: 2016-02-29

Abstract

Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

Cite this article

GU Qi1,2, ZHU Can1, CAO Jian1 . Product matching based on Internet and its implementation[J]. Journal of Shanghai University, 2016 , 22(1) : 58 -68 . DOI: 10.3969/j.issn.1007-2861.2015.04.016

References

[1] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[2] Christen P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection [M]. Berlin: Springer Science and Business Media, 2012.
[3] Christen P. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008: 151-159.
[4] Bhattacharya I, Getoor L. Collective entity resolution in relational data [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007, 1(1): 5.
[5] Li P, Dong X, Maurino A, et al. Linking temporal records [J]. Proceedings of the VLDB Endowment, 2011, 4(11): 956-967.
[6] Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity [C]//ACM SIGMOD Record. 1998: 201-212.
[7] Vandic D, Van Dam J W, Frasincar F. Faceted product search powered by the Semantic Web [J]. Decision Support Systems, 2012, 53(3): 425-437.
[8] Dunn H L. Record linkage [J]. American Journal of Public Health and the Nations Health, 1946, 36(12): 1412-1416.

[9] Newcombe H B, Kennedy J M, Axford S J, et al. Automatic linkage of vital records computers can be used to extract “follow-up” statistics of families from files of routine records [J]. Science, 1959, 130(3381): 954-959.
[10] Fellegi I P, Sunter A B. A theory for record linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[11] Ukkonen E. Approximate string-matching with q-grams and maximal matches [J]. Theoretical Computer Science, 1992, 92(1): 191-211.
[12] Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations [J]. Journal of Computer and System Sciences, 2000, 60(3): 630-659.
[13] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching [C]//Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. 2000: 169-178.

Outlines

/