上海大学学报(自然科学版) ›› 2016, Vol. 22 ›› Issue (1): 58-68.doi: 10.3969/j.issn.1007-2861.2015.04.016

• 大数据 • 上一篇    下一篇

互联网商品匹配算法

顾颀1,2, 朱灿1, 曹健1   

  1. 1. 上海交通大学 电子信息与电气工程学院, 上海 200240; 2. 南通大学 计算机科学与技术学院, 江苏 南通 226019
  • 收稿日期:2015-11-30 出版日期:2016-02-29 发布日期:2016-02-29
  • 通讯作者: 曹健(1972—), 男, 教授, 博士生导师, 博士(后), 研究方向为服务计算、网络计算、大数据分析. E-mail: cao-jian@cs.sjtu.edu.cn
  • 基金资助:

    国家自然科学基金资助项目(61272438, 61472253, 61300167); 上海市科委资助项目(15411952502, 14511107702)

Product matching based on Internet and its implementation

GU Qi1,2, ZHU Can1, CAO Jian1   

  1. 1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. School of Computer Science and Technology, Nantong University, Nantong 226019, Jiangsu, China
  • Received:2015-11-30 Online:2016-02-29 Published:2016-02-29

摘要:

实体解析是指识别同一实体的不同描述形式的过程, 旨在保障数据质量, 是数据清理、数据集成及数据挖掘中的关键技术. 随着电子商务的不断发展和成熟, 商品的多样性和消费者灵活的购买方式, 使得对网络商品的精确识别和匹配成为大数据时代亟待解决的问题. 与传统实体解析主要针对结构化数据不同, 网络数据具有非结构化、异构和海量的特性, 为此设计了综合相似度算法(synthesized similarity method, SSM)来计算网络商品数据间的相似度, 同时引入凝聚的层次聚类框架, 以匹配来自不同数据源的异构商品. 此外, 为了解决大数据环境下对执行效率的要求, 从字符串相似度缓存、约束知识库和分块策略三个方面对SSM进行优化, 基于真实数据集的实验结果验证了SSM的执行效率和有效性.

关键词: 大数据, 非结构化数据, 商品匹配, 实体解析

Abstract:

Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

Key words: big data, entity resolution, product matching, unstructured data