Journal of Shanghai University >
Product matching based on Internet and its implementation
Received date: 2015-11-30
Online published: 2016-02-29
Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.
Key words: big data; entity resolution; product matching; unstructured data
GU Qi1,2, ZHU Can1, CAO Jian1 . Product matching based on Internet and its implementation[J]. Journal of Shanghai University, 2016 , 22(1) : 58 -68 . DOI: 10.3969/j.issn.1007-2861.2015.04.016
[1] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[2] Christen P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection [M]. Berlin: Springer Science and Business Media, 2012.
[3] Christen P. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008: 151-159.
[4] Bhattacharya I, Getoor L. Collective entity resolution in relational data [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007, 1(1): 5.
[5] Li P, Dong X, Maurino A, et al. Linking temporal records [J]. Proceedings of the VLDB Endowment, 2011, 4(11): 956-967.
[6] Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity [C]//ACM SIGMOD Record. 1998: 201-212.
[7] Vandic D, Van Dam J W, Frasincar F. Faceted product search powered by the Semantic Web [J]. Decision Support Systems, 2012, 53(3): 425-437.
[8] Dunn H L. Record linkage [J]. American Journal of Public Health and the Nations Health, 1946, 36(12): 1412-1416.
[9] Newcombe H B, Kennedy J M, Axford S J, et al. Automatic linkage of vital records computers can be used to extract “follow-up” statistics of families from files of routine records [J]. Science, 1959, 130(3381): 954-959.
[10] Fellegi I P, Sunter A B. A theory for record linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[11] Ukkonen E. Approximate string-matching with q-grams and maximal matches [J]. Theoretical Computer Science, 1992, 92(1): 191-211.
[12] Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations [J]. Journal of Computer and System Sciences, 2000, 60(3): 630-659.
[13] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching [C]//Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. 2000: 169-178.
/
| 〈 |
|
〉 |