Journal of Shanghai University（Natural Science Edition）

Let's go big data

GUO Yike

2016, 22(1): 1-2. doi:10.3969/j.issn.1007-2861.2015.05.016

Asbtract ( 627 )

PDF (1502KB) ( 396 )

Related Articles | Metrics

On the challenge for supercomputer design in the big data era

LIAO Xiangke, TAN Yusong, LU Yutong, XIE Min, ZHOU Enqiang, HUANG Jie

2016, 22(1): 3-16. doi:10.3969/j.issn.1007-2861.2015.03.014

Asbtract ( 810 )

PDF (7213KB) ( 604 )

References | Related Articles | Metrics

Because traditional supercomputer is designed for high-performance computing, big data processing applications brings some software and hardware challenges including compute, storage, communication and programming. This paper introduces optimization methods of Tianhe-2 supercomputer system to process big data, such as a new heterogeneous polymorphic architecture, custom high-speed TH-Express 2+ interconnection network, hybrid hierarchical storage system and hybrid computing pattern framework.These efforts maybe make help for how to design supercomputers in the age of big data.

Precision medicine and big data

GUO Yike1,2, YANG Xian2

2016, 22(1): 17-27. doi:10.3969/j.issn.1007-2861.2015.05.015

Asbtract ( 659 )

PDF (7692KB) ( 433 )

References | Related Articles | Metrics

To achieve precision medicine, collecting and analysing various big data are needed to quantify individual patients. This paper first discusses the need of using data from molecular level to pathway level and also incorporating medical imaging data. Different preprocessing methods should be developed for different data type, while some postprocessing steps for various data types, such as classification and network analysis, can be done by a generalized approach. From the perspective of research questions, this paper then studies methods for answering five typical questions from simple to complex. These
questions are detecting associations, identifying groups, constructing classifiers, deriving connectivity and building dynamic models.

KNN-based even sampling preprocessing algorithm for big dataset

JI Chengheng, LEI Yongmei

2016, 22(1): 28-35. doi:10.3969/j.issn.1007-2861.2015.04.020

Asbtract ( 790 )

PDF (5144KB) ( 301 )

References | Related Articles | Metrics

To solve the problem of low efficiency and high storage overheads in densitybased clustering algorithms, an algorithm of even data sampling based on K nearest neighbors (KNN) is proposed as a data preprocessing method of clustering applications. The sampling algorithm slices dataset and gets samples evenly. After slicing a dataset, for part of the samples, the algorithm removes each sample’s K nearest neighbors in a descending order according to the density. The remaining samples are then used as the sample dataset. Experimental results show that, with the increase of data size and the guaranteed accuracy, the sampling algorithm can effectively improve efficiency of clustering by reducing the amount of data needed in clustering.

A context-aware weighting approach for big data of quality ratings in E-commerce

QI Lianyong1,2, DOU Wanchun1, ZHOU Yuming1

2016, 22(1): 36-44. doi:10.3969/j.issn.1007-2861.2015.04.021

Asbtract ( 680 )

PDF (5274KB) ( 283 )

References | Related Articles | Metrics

With the fast development of E-commerce, large amounts of quality rating data for commodities are generated online. By analyzing the rating data, users can evaluate the commodities’quality. However, due to the massiveness and diversity of the rating data, it is a challenge for users to evaluate the commodity quality quickly and accurately. To this end, a context-aware weighting approach for E-commerce ratings, context-aware weighting approach (CWA) is proposed. With CWA, a few important rating data are selected and most unimportant data dropped. Thus the commodity quality can be evaluated quickly and accurately. A series of experiments validate effectiveness of the proposed CWA.

Survey of clustering methods for big data in biology

LU Dongfang, XU Junfu, XIANG Chaojuan, XIE Jiang

2016, 22(1): 45-57. doi:10.3969/j.issn.1007-2861.2015.04.018

Asbtract ( 1201 )

PDF (10758KB) ( 1108 )

References | Related Articles | Metrics

With the implementation of the Human Genome Project and the rapid development of biological experiment technology, biological data sharply grow and continuous accumulate. Age of big data in biology is coming. In the post genomic era, single statistical models are gradually replaced with combination of intelligent and comprehensive analyses. Clustering is the core of data mining. This paper describes the state-of-the-art technology of big data in bioinformatics, and summarizes several popular clustering methods on gene expression profiling and biological networks. Furthermore, some experiments are made to compare different clustering methods on the time series data of mouse embryonic fibroblasts, showing that different clustering methods have different results. To achieve more reliable conclusions for highly noisy biological data, it is necessary for investigators to do comprehensive analyses by selecting and combining proper clustering methods.

Product matching based on Internet and its implementation

GU Qi1,2, ZHU Can1, CAO Jian1

2016, 22(1): 58-68. doi:10.3969/j.issn.1007-2861.2015.04.016

Asbtract ( 1147 )

PDF (1978KB) ( 746 )

References | Related Articles | Metrics

Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

Multilevel hybrid parallel method for big data applications

HUANG Lei1, ZHI Xiaoli1, ZHENG Shengan2

2016, 22(1): 69-80. doi:10.3969/j.issn.1007-2861.2015.04.017

Asbtract ( 711 )

PDF (10032KB) ( 281 )

References | Related Articles | Metrics

Many large data applications require a variety of parallel data processing. This paper presents a two-layer hybrid parallel method, i.e., hybrid parallel of execution units and hybrid parallel of computing model. By hybrid parallel of execution units on the same computing node. The computing power of infrastructure can be fully taped, and thus data processing performance can be improved. By integrating several calculation models into the same execution engine in a parallel way, diverse heterogeneous processing modes may be applied. Different hybrid parallel ways can meet different data and calculation characteristics, and meet different parallel objectives as well. This paper introduces the basic ideas of hybrid parallel methods, and describes main implementation mechanisms of hybrid parallelism.

Big data analysis of next generation video surveillance system for public security

YAN Zhiguo, XU Zheng, MEI Lin, HU Chuanping

2016, 22(1): 81-87. doi:10.3969/j.issn.1007-2861.2015.04.015

Asbtract ( 1050 )

PDF (3242KB) ( 667 )

References | Related Articles | Metrics

Video surveillance has become an important tool due to its rich, intuitive and accurate information. However, with the large-scale construction of video surveillance systems all over the world, useful information and clues cannot be found immediately from the huge video data. The problem affects detection efficiency in crime prediction and public security governance. A great variety of public security information systems have been built for management of traffic accidents, and prediction of criminal events and terrorist attacks. However, large redundant construction of systems leads to great waste of IT resource and
information overload. Technologies such as big data, cloud computing and virtualization have been applied in the public security industry to solve these problems. This paper describes a novel architecture for the next generation public security system in which a front-plus-back pattern is used. In the architecture, cloud technologies such as distributed storage and computing, data retrieval of huge and heterogeneous data are introduced. Multiple optimized strategies are proposed to enhance resources utilization and efficiency of tasks.

Recognition of Chinese characters on license plates based on big data

SHEN Wenfeng, ZHANG Jianlei, ZHOU Dingqian, CHEN Shengbo, QIU Feng

2016, 22(1): 88-96. doi:10.3969/j.issn.1007-2861.2015.04.019

Asbtract ( 1008 )

PDF (7518KB) ( 445 )

References | Related Articles | Metrics

Today, traffic provides sources of huge scale data sets on the network, calling for the development of intelligent traffic. The license plate recognition (LPR) techniques are an important basis of intelligent traffic, and widely applied in applications such as garage management and traffic monitoring. However, the current LPR algorithms are imperfect in terms of recognition accuracy. Although working well in recognizing English letters and digits, they are unsatisfactory in recognizing Chinese characters. This paper proposes a license plate recognition algorithm using a deep belief network (DBN) algorithm consisting of restricted Boltzmann machines (RBM). It greatly improves the quality of Chinese character recognition with accuracy rate up to 99.44%.

Predicting number of online users by ε-SVR

GU Chundong

2016, 22(1): 97-104. doi:10.3969/j.issn.1007-2861.2015.05.001

Asbtract ( 765 )

PDF (6140KB) ( 406 )

References | Related Articles | Metrics

Predicting the number of online audio-visual users can provide valuable information to help manufacturers get more profits. Based on time series analysis, support vector regression is used to make accurate prediction with adjusted feature. The time series is first modeled and predicted, a linear regression model used to make further improvement, and then, by combining time and real-life characteristics, adding a new feature. Samples of the new feature are trained with support vector regression. Optimal parameters of the radial basis function are sought using the social cognitive optimization. A good prediction result can be obtained using the proposed method.

Table of Content