上海大学学报(自然科学版) ›› 2018, Vol. 24 ›› Issue (5): 730-744.doi: 10.12066/j.issn.1007-2861.1843

• 研究论文 • 上一篇    下一篇

数据起源在多版本文档检索中的应用

陈悦1, 董红斌2, 谭成予1, 梁意文1()   

  1. 1. 武汉大学 计算机学院, 武汉 430072
    2. 武汉大学 国际软件学院, 武汉 430079
  • 收稿日期:2016-09-23 出版日期:2018-10-30 发布日期:2018-10-26
  • 通讯作者: 梁意文 E-mail:ywliang@whu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61170306);国家高技术研究发展计划(863计划)资助项目(2012AA09A410)

Application of data provenance in multi-version documents retrieval

CHEN Yue1, DONG Hongbin2, TAN Chengyu1, LIANG Yiwen1()   

  1. 1. School of Computer Science, Wuhan University, Wuhan 430072, China
    2. International School of Software, Wuhan University, Wuhan 430079, China
  • Received:2016-09-23 Online:2018-10-30 Published:2018-10-26
  • Contact: LIANG Yiwen E-mail:ywliang@whu.edu.cn

摘要:

随着计算机的普及和大数据时代的来临, 个人计算机中文档的版本数急剧增加, 用户想要迅速找到所需的文档绝非易事. 相关研究表明, 文件的起源信息可以为用户提供快速定位目标文档的线索. 已有的一些基于数据起源的检索方式, 其起源粒度多数是文件级的. 但对于内容相关性较高的文档来说, 文件级的起源信息无法清晰地描述内容间的关联关系, 也就无法给予用户充分的帮助. 基于 PROV 模型, 针对文档版本的变化建立内容级的起源概念模型, 并给出了起源词汇表. 在资源描述框架 (resource description framework, RDF) 语言的基础上建立了起源信息的查询访问机制, 并给出了可视化方案, 为用户提供直观的信息表达. 结果表明, 该方法通过对文档检索结果的扩展和解释, 可以为用户提供更有价值的帮助信息, 从而达到快速锁定目标文件的目的, 提高工作效率.

关键词: 多版本文档, 文档检索, 数据起源, PROV模型

Abstract:

As the big data era emerges, the number of document versions is rapidly growing to make document retrieval difficult. Related studies show that provenance information is an important cue in helping users find needed documents. Information retrieval researches based on data provenance often capture files events that cannot describe particular relationship between documents, and therefore are not useful enough for re-finding documents. This paper presents a provenance model based on PROV at the content level, and constructs a specific vocabulary for multi-version documents retrieval. Furthermore, a low-level mode is described with resource description framework (RDF), and the high-level is formed based on query of the former. Finally, to give users a more accessible way to evaluate information, a visualization method of the provenance information is proposed. The results show that the model provides users with more valuable cues by using provenance information to expand retrieval results, and help them find target document quickly and improve efficiency.

Key words: multi-version documents, document retrieval, data provenance, PROV

中图分类号: