收稿日期: 2015-11-30
网络出版日期: 2016-02-29
基金资助
国家自然科学基金资助项目(61272438, 61472253, 61300167); 上海市科委资助项目(15411952502, 14511107702)
Product matching based on Internet and its implementation
Received date: 2015-11-30
Online published: 2016-02-29
实体解析是指识别同一实体的不同描述形式的过程, 旨在保障数据质量, 是数据清理、数据集成及数据挖掘中的关键技术. 随着电子商务的不断发展和成熟, 商品的多样性和消费者灵活的购买方式, 使得对网络商品的精确识别和匹配成为大数据时代亟待解决的问题. 与传统实体解析主要针对结构化数据不同, 网络数据具有非结构化、异构和海量的特性, 为此设计了综合相似度算法(synthesized similarity method, SSM)来计算网络商品数据间的相似度, 同时引入凝聚的层次聚类框架, 以匹配来自不同数据源的异构商品. 此外, 为了解决大数据环境下对执行效率的要求, 从字符串相似度缓存、约束知识库和分块策略三个方面对SSM进行优化, 基于真实数据集的实验结果验证了SSM的执行效率和有效性.
顾颀1,2, 朱灿1, 曹健1 . 互联网商品匹配算法[J]. 上海大学学报(自然科学版), 2016 , 22(1) : 58 -68 . DOI: 10.3969/j.issn.1007-2861.2015.04.016
Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.
Key words: big data; entity resolution; product matching; unstructured data
[1] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[2] Christen P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection [M]. Berlin: Springer Science and Business Media, 2012.
[3] Christen P. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008: 151-159.
[4] Bhattacharya I, Getoor L. Collective entity resolution in relational data [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007, 1(1): 5.
[5] Li P, Dong X, Maurino A, et al. Linking temporal records [J]. Proceedings of the VLDB Endowment, 2011, 4(11): 956-967.
[6] Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity [C]//ACM SIGMOD Record. 1998: 201-212.
[7] Vandic D, Van Dam J W, Frasincar F. Faceted product search powered by the Semantic Web [J]. Decision Support Systems, 2012, 53(3): 425-437.
[8] Dunn H L. Record linkage [J]. American Journal of Public Health and the Nations Health, 1946, 36(12): 1412-1416.
[9] Newcombe H B, Kennedy J M, Axford S J, et al. Automatic linkage of vital records computers can be used to extract “follow-up” statistics of families from files of routine records [J]. Science, 1959, 130(3381): 954-959.
[10] Fellegi I P, Sunter A B. A theory for record linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[11] Ukkonen E. Approximate string-matching with q-grams and maximal matches [J]. Theoretical Computer Science, 1992, 92(1): 191-211.
[12] Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations [J]. Journal of Computer and System Sciences, 2000, 60(3): 630-659.
[13] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching [C]//Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. 2000: 169-178.
/
| 〈 |
|
〉 |