Loading...

Table of Content

    29 February 2016, Volume 22 Issue 1
    Let's go big data
    GUO Yike
    2016, 22(1):  1-2.  doi:10.3969/j.issn.1007-2861.2015.05.016
    Asbtract ( 627 )   PDF (1502KB) ( 396 )  
    Related Articles | Metrics
    On the challenge for supercomputer design in the big data era
    LIAO Xiangke, TAN Yusong, LU Yutong, XIE Min, ZHOU Enqiang, HUANG Jie
    2016, 22(1):  3-16.  doi:10.3969/j.issn.1007-2861.2015.03.014
    Asbtract ( 810 )   PDF (7213KB) ( 604 )  
    References | Related Articles | Metrics

    Because traditional supercomputer is designed for high-performance computing, big data processing applications brings some software and hardware challenges including compute, storage, communication and programming. This paper introduces optimization methods of Tianhe-2 supercomputer system to process big data, such as a new heterogeneous polymorphic architecture, custom high-speed TH-Express 2+ interconnection network, hybrid hierarchical storage system and hybrid computing pattern framework.These efforts maybe make help for how to design supercomputers in the age of big data.

    Precision medicine and big data
    GUO Yike1,2, YANG Xian2
    2016, 22(1):  17-27.  doi:10.3969/j.issn.1007-2861.2015.05.015
    Asbtract ( 659 )   PDF (7692KB) ( 433 )  
    References | Related Articles | Metrics

    To achieve precision medicine, collecting and analysing various big data are needed to quantify individual patients. This paper first discusses the need of using data from molecular level to pathway level and also incorporating medical imaging data. Different preprocessing methods should be developed for different data type, while some postprocessing steps for various data types, such as classification and network analysis, can be done by a generalized approach. From the perspective of research questions, this paper then studies methods for answering five typical questions from simple to complex. These
    questions are detecting associations, identifying groups, constructing classifiers, deriving connectivity and building dynamic models.

    KNN-based even sampling preprocessing algorithm for big dataset
    JI Chengheng, LEI Yongmei
    2016, 22(1):  28-35.  doi:10.3969/j.issn.1007-2861.2015.04.020
    Asbtract ( 790 )   PDF (5144KB) ( 301 )  
    References | Related Articles | Metrics

    To solve the problem of low efficiency and high storage overheads in densitybased clustering algorithms, an algorithm of even data sampling based on K nearest neighbors (KNN) is proposed as a data preprocessing method of clustering applications. The sampling algorithm slices dataset and gets samples evenly. After slicing a dataset, for part of the samples, the algorithm removes each sample’s K nearest neighbors in a descending order according to the density. The remaining samples are then used as the sample dataset. Experimental results show that, with the increase of data size and the guaranteed accuracy, the sampling algorithm can effectively improve efficiency of clustering by reducing the amount of data needed in clustering.

    A context-aware weighting approach for big data of quality ratings in E-commerce
    QI Lianyong1,2, DOU Wanchun1, ZHOU Yuming1
    2016, 22(1):  36-44.  doi:10.3969/j.issn.1007-2861.2015.04.021
    Asbtract ( 680 )   PDF (5274KB) ( 283 )  
    References | Related Articles | Metrics

    With the fast development of E-commerce, large amounts of quality rating data for commodities are generated online. By analyzing the rating data, users can evaluate the commodities’quality. However, due to the massiveness and diversity of the rating data, it is a challenge for users to evaluate the commodity quality quickly and accurately. To this end, a context-aware weighting approach for E-commerce ratings, context-aware weighting approach (CWA) is proposed. With CWA, a few important rating data are selected and most unimportant data dropped. Thus the commodity quality can be evaluated quickly and accurately. A series of experiments validate effectiveness of the proposed CWA.

    Survey of clustering methods for big data in biology
    LU Dongfang, XU Junfu, XIANG Chaojuan, XIE Jiang
    2016, 22(1):  45-57.  doi:10.3969/j.issn.1007-2861.2015.04.018
    Asbtract ( 1201 )   PDF (10758KB) ( 1108 )  
    References | Related Articles | Metrics

    With the implementation of the Human Genome Project and the rapid development of biological experiment technology, biological data sharply grow and continuous accumulate. Age of big data in biology is coming. In the post genomic era, single statistical models are gradually replaced with combination of intelligent and comprehensive analyses. Clustering is the core of data mining. This paper describes the state-of-the-art technology of big data in bioinformatics, and summarizes several popular clustering methods on gene expression profiling and biological networks. Furthermore, some experiments are made to compare different clustering methods on the time series data of mouse embryonic fibroblasts, showing that different clustering methods have different results. To achieve more reliable conclusions for highly noisy biological data, it is necessary for investigators to do comprehensive analyses by selecting and combining proper clustering methods.

    Product matching based on Internet and its implementation
    GU Qi1,2, ZHU Can1, CAO Jian1
    2016, 22(1):  58-68.  doi:10.3969/j.issn.1007-2861.2015.04.016
    Asbtract ( 1147 )   PDF (1978KB) ( 746 )  
    References | Related Articles | Metrics

    Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

    Multilevel hybrid parallel method for big data applications
    HUANG Lei1, ZHI Xiaoli1, ZHENG Shengan2
    2016, 22(1):  69-80.  doi:10.3969/j.issn.1007-2861.2015.04.017
    Asbtract ( 711 )   PDF (10032KB) ( 281 )  
    References | Related Articles | Metrics

    Many large data applications require a variety of parallel data processing. This paper presents a two-layer hybrid parallel method, i.e., hybrid parallel of execution units and hybrid parallel of computing model. By hybrid parallel of execution units on the same computing node. The computing power of infrastructure can be fully taped, and thus data processing performance can be improved. By integrating several calculation models into the same execution engine in a parallel way, diverse heterogeneous processing modes may be applied. Different hybrid parallel ways can meet different data and calculation characteristics, and meet different parallel objectives as well. This paper introduces the basic ideas of hybrid parallel methods, and describes main implementation mechanisms of hybrid parallelism.

    Big data analysis of next generation video surveillance system for public security
    YAN Zhiguo, XU Zheng, MEI Lin, HU Chuanping
    2016, 22(1):  81-87.  doi:10.3969/j.issn.1007-2861.2015.04.015
    Asbtract ( 1050 )   PDF (3242KB) ( 667 )  
    References | Related Articles | Metrics

    Video surveillance has become an important tool due to its rich, intuitive and accurate information. However, with the large-scale construction of video surveillance systems all over the world, useful information and clues cannot be found immediately from the huge video data. The problem affects detection efficiency in crime prediction and public security governance. A great variety of public security information systems have been built for management of traffic accidents, and prediction of criminal events and terrorist attacks. However, large redundant construction of systems leads to great waste of IT resource and
    information overload. Technologies such as big data, cloud computing and virtualization have been applied in the public security industry to solve these problems. This paper describes a novel architecture for the next generation public security system in which a front-plus-back pattern is used. In the architecture, cloud technologies such as distributed storage and computing, data retrieval of huge and heterogeneous data are introduced. Multiple optimized strategies are proposed to enhance resources utilization and efficiency of tasks.

    Recognition of Chinese characters on license plates based on big data
    SHEN Wenfeng, ZHANG Jianlei, ZHOU Dingqian, CHEN Shengbo, QIU Feng
    2016, 22(1):  88-96.  doi:10.3969/j.issn.1007-2861.2015.04.019
    Asbtract ( 1008 )   PDF (7518KB) ( 445 )  
    References | Related Articles | Metrics

    Today, traffic provides sources of huge scale data sets on the network, calling for the development of intelligent traffic. The license plate recognition (LPR) techniques are an important basis of intelligent traffic, and widely applied in applications such as garage management and traffic monitoring. However, the current LPR algorithms are imperfect in terms of recognition accuracy. Although working well in recognizing English letters and digits, they are unsatisfactory in recognizing Chinese characters. This paper proposes a license plate recognition algorithm using a deep belief network (DBN) algorithm consisting of restricted Boltzmann machines (RBM). It greatly improves the quality of Chinese character recognition with accuracy rate up to 99.44%.

    Predicting number of online users by ε-SVR
    GU Chundong
    2016, 22(1):  97-104.  doi:10.3969/j.issn.1007-2861.2015.05.001
    Asbtract ( 765 )   PDF (6140KB) ( 406 )  
    References | Related Articles | Metrics

    Predicting the number of online audio-visual users can provide valuable information to help manufacturers get more profits. Based on time series analysis, support vector regression is used to make accurate prediction with adjusted feature. The time series is first modeled and predicted, a linear regression model used to make further improvement, and then, by combining time and real-life characteristics, adding a new feature. Samples of the new feature are trained with support vector regression. Optimal parameters of the radial basis function are sought using the social cognitive optimization. A good prediction result can be obtained using the proposed method.