互联网商品匹配算法

doi:10.3969/j.issn.1007-2861.2015.04.016

上海大学学报(自然科学版) ›› 2016, Vol. 22 ›› Issue (1): 58-68.doi: 10.3969/j.issn.1007-2861.2015.04.016

互联网商品匹配算法

顾颀1,2, 朱灿1, 曹健1

1. 上海交通大学电子信息与电气工程学院, 上海 200240; 2. 南通大学计算机科学与技术学院, 江苏南通 226019

收稿日期:2015-11-30 出版日期:2016-02-29 发布日期:2016-02-29
通讯作者: 曹健(1972—), 男, 教授, 博士生导师, 博士(后), 研究方向为服务计算、网络计算、大数据分析. E-mail: cao-jian@cs.sjtu.edu.cn
基金资助:
国家自然科学基金资助项目(61272438, 61472253, 61300167); 上海市科委资助项目(15411952502, 14511107702)

Product matching based on Internet and its implementation

GU Qi1,2, ZHU Can1, CAO Jian1

1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. School of Computer Science and Technology, Nantong University, Nantong 226019, Jiangsu, China

Received:2015-11-30 Online:2016-02-29 Published:2016-02-29

摘要/Abstract

摘要：

实体解析是指识别同一实体的不同描述形式的过程, 旨在保障数据质量, 是数据清理、数据集成及数据挖掘中的关键技术. 随着电子商务的不断发展和成熟, 商品的多样性和消费者灵活的购买方式, 使得对网络商品的精确识别和匹配成为大数据时代亟待解决的问题. 与传统实体解析主要针对结构化数据不同, 网络数据具有非结构化、异构和海量的特性, 为此设计了综合相似度算法(synthesized similarity method, SSM)来计算网络商品数据间的相似度, 同时引入凝聚的层次聚类框架, 以匹配来自不同数据源的异构商品. 此外, 为了解决大数据环境下对执行效率的要求, 从字符串相似度缓存、约束知识库和分块策略三个方面对SSM进行优化, 基于真实数据集的实验结果验证了SSM的执行效率和有效性.

关键词: 大数据, 非结构化数据, 商品匹配, 实体解析

Abstract:

Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

Key words: big data, entity resolution, product matching, unstructured data

顾颀1,2, 朱灿1, 曹健1. 互联网商品匹配算法[J]. 上海大学学报(自然科学版), 2016, 22(1): 58-68.

GU Qi1,2, ZHU Can1, CAO Jian1. Product matching based on Internet and its implementation[J]. Journal of Shanghai University（Natural Science Edition）, 2016, 22(1): 58-68.

参考文献

[1] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
[2] Christen P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection [M]. Berlin: Springer Science and Business Media, 2012.
[3] Christen P. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008: 151-159.
[4] Bhattacharya I, Getoor L. Collective entity resolution in relational data [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007, 1(1): 5.
[5] Li P, Dong X, Maurino A, et al. Linking temporal records [J]. Proceedings of the VLDB Endowment, 2011, 4(11): 956-967.
[6] Cohen W W. Integration of heterogeneous databases without common domains using queries based on textual similarity [C]//ACM SIGMOD Record. 1998: 201-212.
[7] Vandic D, Van Dam J W, Frasincar F. Faceted product search powered by the Semantic Web [J]. Decision Support Systems, 2012, 53(3): 425-437.
[8] Dunn H L. Record linkage [J]. American Journal of Public Health and the Nations Health, 1946, 36(12): 1412-1416.

[9] Newcombe H B, Kennedy J M, Axford S J, et al. Automatic linkage of vital records computers can be used to extract “follow-up” statistics of families from files of routine records [J]. Science, 1959, 130(3381): 954-959.
[10] Fellegi I P, Sunter A B. A theory for record linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[11] Ukkonen E. Approximate string-matching with q-grams and maximal matches [J]. Theoretical Computer Science, 1992, 92(1): 191-211.
[12] Broder A Z, Charikar M, Frieze A M, et al. Min-wise independent permutations [J]. Journal of Computer and System Sciences, 2000, 60(3): 630-659.
[13] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching [C]//Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. 2000: 169-178.

互联网商品匹配算法

Product matching based on Internet and its implementation

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价

[1]	郭毅可1,2, 杨氙2. 精确医学与大数据[J]. 上海大学学报(自然科学版), 2016, 22(1): 17-27.
[2]	廖湘科, 谭郁松, 卢宇彤, 谢旻, 周恩强, 黄杰. 面向大数据应用挑战的超级计算机设计[J]. 上海大学学报(自然科学版), 2016, 22(1): 3-16.
[3]	郭毅可. 走好我们的大数据之路[J]. 上海大学学报(自然科学版), 2016, 22(1): 1-2.
[4]	齐连永1,2, 窦万春1, 周毓明1. 一种上下文感知的E-commerce评级大数据赋权方法[J]. 上海大学学报(自然科学版), 2016, 22(1): 36-44.
[5]	路东方, 许俊富, 项超娟, 谢江. 生物大数据中的聚类方法分析[J]. 上海大学学报(自然科学版), 2016, 22(1): 45-57.
[6]	颜志国, 徐峥, 梅林, 胡传平. 下一代公共安全视频监控系统的大数据分析[J]. 上海大学学报(自然科学版), 2016, 22(1): 81-87.
[7]	宋宇飞, 张俊杰, 王凯, 李家齐, 薛子威, 郑玥. 基于IP协议的FPGA万兆可靠互联通信设计与实现[J]. 上海大学学报(自然科学版), 2015, 21(5): 570-578.