To read this content please select one of the options below:

Integration and optimization of multiple big data processing platforms

Bao-Rong Chang (Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan)
Hsiu-Fen Tsai (Department of Marketing Management, Shu Te University, Kaohsiung, Taiwan)
Yun-Che Tsai (Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan)
Chin-Fu Kuo (Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan)
Chi-Chung Chen (Department of Electrical Engineering, National Chiayi University, Chiayi City, Taiwan)

Engineering Computations

ISSN: 0264-4401

Article publication date: 1 August 2016

501

Abstract

Purpose

The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big data environment.

Design/methodology/approach

First, the integration of Apache Hive, Cloudera Impala and BDAS Shark make the platform support SQL-like query. Next, users can access a single interface and select the best performance of big data warehouse platform automatically by the proposed optimizer. Finally, the distributed memory storage system Memcached incorporated into the distributed file system, Apache HDFS, is employed for fast caching query results. Therefore, if users query the same SQL command, the same result responds rapidly from the cache system instead of suffering the repeated searches in a big data warehouse and taking a longer time to retrieve.

Findings

As a result the proposed approach significantly improves the overall performance and dramatically reduces the search time as querying a database, especially applying for the high-repeatable SQL commands under multi-user mode.

Research limitations/implications

Currently, Shark’s latest stable version 0.9.1 does not support the latest versions of Spark and Hive. In addition, this series of software only supports Oracle JDK7. Using Oracle JDK8 or Open JDK will cause serious errors, and some software will be unable to run.

Practical implications

The problem with this system is that some blocks are missing when too many blocks are stored in one result (about 100,000 records). Another problem is that the sequential writing into In-memory cache wastes time.

Originality/value

When the remaining memory capacity is 2 GB or less on each server, Impala and Shark will have a lot of page swapping, causing extremely low performance. When the data scale is larger, it may cause the JVM I/O exception and make the program crash. However, when the remaining memory capacity is sufficient, Shark is faster than Hive and Impala. Impala’s consumption of memory resources is between those of Shark and Hive. This amount of remaining memory is sufficient for Impala’s maximum performance. In this study, each server allocates 20 GB of memory for cluster computing and sets the amount of remaining memory as Level 1: 3 percent (0.6 GB), Level 2: 15 percent (3 GB) and Level 3: 75 percent (15 GB) as the critical points. The program automatically selects Hive when memory is less than 15 percent, Impala at 15 to 75 percent and Shark at more than 75 percent.

Keywords

Acknowledgements

This work is supported by the Ministry of Science and Technology, Taiwan, Republic of China, under Grant No. MOST 104-2221-E-390-010.

Citation

Chang, B.-R., Tsai, H.-F., Tsai, Y.-C., Kuo, C.-F. and Chen, C.-C. (2016), "Integration and optimization of multiple big data processing platforms", Engineering Computations, Vol. 33 No. 6, pp. 1680-1704. https://doi.org/10.1108/EC-08-2015-0247

Publisher

:

Emerald Group Publishing Limited

Copyright © 2016, Emerald Group Publishing Limited

Related articles