Survey of clustering methods for big data in biology

Expand
  • School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

Received date: 2015-11-30

  Online published: 2016-02-29

Abstract

With the implementation of the Human Genome Project and the rapid development of biological experiment technology, biological data sharply grow and continuous accumulate. Age of big data in biology is coming. In the post genomic era, single statistical models are gradually replaced with combination of intelligent and comprehensive analyses. Clustering is the core of data mining. This paper describes the state-of-the-art technology of big data in bioinformatics, and summarizes several popular clustering methods on gene expression profiling and biological networks. Furthermore, some experiments are made to compare different clustering methods on the time series data of mouse embryonic fibroblasts, showing that different clustering methods have different results. To achieve more reliable conclusions for highly noisy biological data, it is necessary for investigators to do comprehensive analyses by selecting and combining proper clustering methods.

Cite this article

LU Dongfang, XU Junfu, XIANG Chaojuan, XIE Jiang . Survey of clustering methods for big data in biology[J]. Journal of Shanghai University, 2016 , 22(1) : 45 -57 . DOI: 10.3969/j.issn.1007-2861.2015.04.018

References

[1] 赵屹, 谷瑞升, 杜生明. 生物信息学研究现状及发展趋势[J]. 医学信息学杂志, 2012, 33(5): 2-6.
[2] Koboldt D C, Steinberg K M, Larson D E, et al. The next-generation sequencing revolution and its impact on genomics [J]. Cell, 2013, 155(1): 27-38.
[3] 任艳姣. 生物信息学数据整合的应用研究[D]. 长春: 吉林大学, 2012.
[4] Benson D A, Karsch-Mizrachi I, Lipman D J, et al. GenBank [J]. Nucleic Acids Research, 2000, 28(1): 15-18.
[5] Uetz P, Etzold T. The EMBL/EBI reptile database [J]. Herpetological Review, 1996, 27(4): 174-175.
[6] Barrett T, Wilhite S E, Ledoux P, et al. NCBI GEO: archive for functional genomics data sets-update [J]. Nucleic Acids Res, 2013, 41: D1005-D1010.
[7] 王洪昌, 丁立军, 黄宇. 生物信息学中模式识别技术应用与发展[J]. 医学信息学杂志, 2013(11): 7-10.
[8] Li Y, Chen L. Big biological data: challenges and opportunities [J]. Genomics, Proteomics and Bioinformatics, 2014, 12(5): 187-189.
[9] Marx V. Biology: the big challenges of big data [J]. Nature, 2013, 498(7453): 255-260.
[10] Schuster S C. Next-generation sequencing transforms today’s biology [J]. Nature, 2007, 200(8): 16-18.
[11] Reis-Filho J S. Next-generation sequencing [J]. Breast Cancer Res, 2009, 11(S3): S12.
[12] Marcotte E M, Date S V. Exploiting big biology: integrating large-scale biological data for function inference [J]. Briefings in Bioinformatics, 2001, 2(4): 363-374.
[13] Aronova E, Baker K S, Oreskes N. Big science and big data in biology: from the international geophysical year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957—present [J]. Historical Studies in the Natural Sciences, 2010, 40(2): 183-224.
[14] Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis: a survey [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2004, 1(1): 24-45.

[15] 杨春梅, 万柏坤, 高晓峰. 基因表达聚类分析技术的现状与发展[J]. 生物化学与生物物理进展, 2003, 30(6): 974-979.
[16] 黄金. 聚类和分类技术在生物信息学中的应用[D]. 哈尔滨: 黑龙江大学, 2005.
[17] 陈亮. 聚类算法及其在生物信息学中的应用[D]. 无锡: 江南大学, 2012.
[18] Reddy C K, Al Hasan M, Zaki M J. Clustering biological data [M]//Data clustering: algorithms and applications. London: Chapman and Hall/CRC, 2013: 381-414.
[19] Erciyes K. Clustering of biological sequences [M]//Erciyes K. Distributed and sequential algorithms for bioinformatics. Berlin: Springer International Publishing, 2015: 135-160.
[20] Aggarwal C C, Reddy C K. Data clustering: algorithms and applications [M]. Boca Raton: CRC Press, 2014.
[21] Wang M, Zhang W, Ding W, et al. Parallel clustering algorithm for large-scale biological data sets [J]. PLoS ONE, 2014, 9(4): e91315.
[22] Eisen M B, Spellman P T, Brown P O, et al. Cluster analysis and display of genome-wide expression patterns [J]. Proceedings of the National Academy of Sciences, 1998, 95(25): 14863-14868.
[23] Hartemink A J, Gifford D K, Jaakkola T, et al. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks [C]//Pacific Symposium on Biocomputing. 2001: 422-433.
[24] 苏志中. 聚类分析研究及其在生物数据分析中的应用[D]. 长沙: 湖南大学, 2009.
[25] 周洋. 基因表达谱数据聚类分析的研究[D]. 咸阳: 西北农林科技大学, 2014.
[26] Han J, Kamber M, Pei J. Data mining: concepts and techniques: concepts and techniques [M]. Amsterdam: Elsevier, 2011.
[27] Marco E, Karp R L, Guo G, et al. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape [J]. Proceedings of the National Academy of Sciences, 2014, 111(52): E5643-E5650.
[28] 张琛. 生物信息学中的基因表达谱数据分析研究[D]. 长春: 吉林大学, 2008.
[29] Trapnell C, Cacchiarelli D, Grimsby J, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells [J]. Nature Biotechnology,
2014, 32(4): 381-386.
[30] Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview [J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2012, 2(1): 86-97.
[31] Levine J H, Simonds E F, Bendall S C, et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis [J]. Cell, 2015, 162(1): 184-197.
[32] 安平. 基因表达数据的双聚类分析方法研究[D]. 苏州: 苏州大学, 2013.
[33] Gerstein M B, Kundaje A, Hariharan M, et al. Architecture of the human regulatory network derived from ENCODE data [J]. Nature, 2012, 489(7414): 91-100.
[34] 王正华, 董蕴源, 王勇献. 蛋白质相互作用网络的几种聚类方法综述[J]. 国防科技大学学报, 2009, 31(004): 81-86.
[35] 刘昊, 廖波, 彭利红. 基于蛋白质相互作用网络的聚类算法研究[J]. 计算机工程与应用, 2009, 44(30): 142-144.

[36] Ji J Z, Zhang A D, Liu C N, et al. Survey: functional module detection from protein-protein interaction networks [J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2): 261-277.
[37] Blondel V D, Guillaume J L, Lambiotte R, et al. Fast unfolding of communities in large networks [J]. Journal of Statistical Mechanics: Theory and Experiment, 2008, DOI: 10.1088/1742-5468/2008/10/P10008.
[38] Xiang C J, Xie J, Gu Y L, et al. Visualization of module alignment discovery [C]//Control Conference (CCC). 2015: 8545-8549.
[39] Asur S, Ucar D, Parthasarathy S. An ensemble framework for clustering protein-protein interaction networks [J]. Bioinformatics, 2007, 23(13): i29-i40.
[40] Tibshirani R, Hastie T, Eisen M, et al. Clustering methods for the analysis of DNA microarray data [R]. Stanford: Stanford University, 1999.
[41] Sten I, Ansgar H C, Riin R, et al. Estimating differential expression from multiple indicators [J]. Nucleic Acids Research, 2014, 42(8): e72.
[42] Liao T W. Clustering of time series data—a survey [J]. Pattern Recognition, 2005, 38(11): 1857-1874.
[43] Torarinsson E, Havgaard J H, Gorodkin J. Multiple structural alignment and clustering of RNA sequences [J]. Bioinformatics, 2007, 23(8): 926-932.
[44] FitzGerald P C, Shlyakhtenko A, Mir A A, et al. Clustering of DNA sequences in human promoters [J]. Genome Research, 2004, 14(8): 1562-1574.

 

Outlines

/