上海大学学报(自然科学版) ›› 2016, Vol. 22 ›› Issue (1): 45-57.doi: 10.3969/j.issn.1007-2861.2015.04.018

• 大数据 • 上一篇    下一篇

生物大数据中的聚类方法分析

路东方, 许俊富, 项超娟, 谢江   

  1. 上海大学 计算机工程与科学学院, 上海 200444
  • 收稿日期:2015-11-30 出版日期:2016-02-29 发布日期:2016-02-29
  • 通讯作者: 谢江(1971—), 女, 副教授, 博士, 研究方向为生物信息学、高性能计算. E-mail: jiangxsh@shu.edu.cn
  • 基金资助:

    国家自然科学基金重大研究计划项目(91330116); 教育部留学回国人员科研启动基金资助项目

Survey of clustering methods for big data in biology

LU Dongfang, XU Junfu, XIANG Chaojuan, XIE Jiang   

  1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
  • Received:2015-11-30 Online:2016-02-29 Published:2016-02-29

摘要:

随着人类基因组计划的实施和完成, 生物实验技术快速发展, 生物数据呈现爆发式增长并不断积累, 生命科学迎来了大数据时代. 在后基因组时代, 单一的统计模式逐渐被智能化与综合分析相结合的方式所取代, 聚类分析便是核心的数据挖掘方式. 描述了生物信息学领域中的大数据现状, 总结基因表达谱分析和生物网络分析中常用的聚类方法, 并对小鼠胚胎成纤维细胞的时间序列数据进行实验对比. 实验结果表明, 不同的聚类方法生成了不同的实验结果, 面临高噪声的生物大数据, 选择或结合合适的聚类方法进行综合分析将有助于获得更可靠的分析结果.

关键词: 聚类方法, 生物大数据, 数据分析

Abstract:

With the implementation of the Human Genome Project and the rapid development of biological experiment technology, biological data sharply grow and continuous accumulate. Age of big data in biology is coming. In the post genomic era, single statistical models are gradually replaced with combination of intelligent and comprehensive analyses. Clustering is the core of data mining. This paper describes the state-of-the-art technology of big data in bioinformatics, and summarizes several popular clustering methods on gene expression profiling and biological networks. Furthermore, some experiments are made to compare different clustering methods on the time series data of mouse embryonic fibroblasts, showing that different clustering methods have different results. To achieve more reliable conclusions for highly noisy biological data, it is necessary for investigators to do comprehensive analyses by selecting and combining proper clustering methods.

Key words: big data in biology, clustering method, data analysis