Search results

1 – 10 of 131

View access options

Article

Publication date: 22 February 2024

Developing a big data analytics platform using Apache Hadoop Ecosystem for delivering big data services in libraries

Although the challenges associated with big data are increasing, the question of the most suitable big data analytics (BDA) platform in libraries is always significant. The…

HTML

PDF (2 MB)

Downloads

Abstract

Purpose

Although the challenges associated with big data are increasing, the question of the most suitable big data analytics (BDA) platform in libraries is always significant. The purpose of this study is to propose a solution to this problem.

Design/methodology/approach

The current study identifies relevant literature and provides a review of big data adoption in libraries. It also presents a step-by-step guide for the development of a BDA platform using the Apache Hadoop Ecosystem. To test the system, an analysis of library big data using Apache Pig, which is a tool from the Apache Hadoop Ecosystem, was performed. It establishes the effectiveness of Apache Hadoop Ecosystem as a powerful BDA solution in libraries.

Findings

It can be inferred from the literature that libraries and librarians have not taken the possibility of big data services in libraries very seriously. Also, the literature suggests that there is no significant effort made to establish any BDA architecture in libraries. This study establishes the Apache Hadoop Ecosystem as a possible solution for delivering BDA services in libraries.

Research limitations/implications

The present work suggests adapting the idea of providing various big data services in a library by developing a BDA platform, for instance, providing assistance to the researchers in understanding the big data, cleaning and curation of big data by skilled and experienced data managers and providing the infrastructural support to store, process, manage, analyze and visualize the big data.

Practical implications

The study concludes that Apache Hadoops’ Hadoop Distributed File System and MapReduce components significantly reduce the complexities of big data storage and processing, respectively, and Apache Pig, using Pig Latin scripting language, is very efficient in processing big data and responding to queries with a quick response time.

Originality/value

According to the study, there are significantly fewer efforts made to analyze big data from libraries. Furthermore, it has been discovered that acceptance of the Apache Hadoop Ecosystem as a solution to big data problems in libraries are not widely discussed in the literature, although Apache Hadoop is regarded as one of the best frameworks for big data handling.

Details

Digital Library Perspectives, vol. ahead-of-print no. ahead-of-print

Type: Research Article

DOI:

ISSN: 2059-5816

Keywords

View access options

Article

Publication date: 26 April 2022

A comprehensive bibliometric analysis of Apache Hadoop from 2008 to 2020

Jianpeng Zhang and Mingwei Lin

The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for…

HTML

PDF (7.2 MB)

Downloads

245

Abstract

Purpose

The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field, as well as a preliminary knowledge of Apache Hadoop for interested researchers.

Design/methodology/approach

This paper employs the bibliometric analysis and visual analysis approaches to systematically study and analyze publications about Apache Hadoop in the Web of Science database. This study aims to investigate the topic of Apache Hadoop by means of bibliometric analysis with the aid of visualization applications. Through the bibliometric analysis of the collected documents, this paper analyzes the main statistical characteristics and cooperation networks. Research themes, research hotspots and future development trends are also investigated through the keyword analysis.

Findings

The research on Apache Hadoop is still the top priority in the future, and how to improve the performance of Apache Hadoop in the era of big data is one of the research hotspots.

Research limitations/implications

This paper makes a comprehensive analysis of Apache Hadoop with methods of bibliometrics, and it is valuable for researchers can quickly grasp the hot topics in this area.

Originality/value

This paper draws the structural characteristics of the publications in this field and summarizes the research hotspots and trends in this field in recent years, aiming to understand the development status and trends in this field and inspire new ideas for researchers.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 16 no. 1

Type: Research Article

DOI:

ISSN: 1756-378X

Keywords

View access options

Article

Publication date: 6 August 2021

Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasks

Alexander Döschl, Max-Emanuel Keller and Peter Mandl

This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing…

HTML

PDF (548 KB)

Downloads

Abstract

Purpose

This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and resilient distributed data set (RDD) (Apache Spark) paradigms and a graphics processing unit (GPU) approach with Numba for compute unified device architecture (CUDA).

Design/methodology/approach

The paper uses a simple but computationally intensive puzzle as a case study for experiments. To find all solutions using brute force search, 15! permutations had to be computed and tested against the solution rules. The experimental application comprises a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) paradigms and a GPU approach with Numba for CUDA. The implementations were benchmarked on Amazon-EC2 instances for performance and scalability measurements.

Findings

The comparison of the solutions with Apache Hadoop and Apache Spark under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30% lower, while the performance of Spark especially benefits from an increasing number of tasks. With the CUDA implementation, more than 16 times faster execution is achievable for the same price compared to the Spark solution. Apart from the multi-threaded implementation, the processing times of all solutions scale approximately linearly. Finally, several application suggestions for the different parallelization approaches are derived from the insights of this study.

Originality/value

There are numerous studies that have examined the performance of parallelization approaches. Most of these studies deal with processing large amounts of data or mathematical problems. This work, in contrast, compares these technologies on their ability to implement computationally intensive distributed algorithms.

Details

International Journal of Web Information Systems, vol. 17 no. 4

Type: Research Article

DOI:

ISSN: 1744-0084

Keywords

Content available

Article

Publication date: 25 November 2013

New & Noteworthy

Heidi Hanson and Zoe Stewart-Marshall

HTML

Downloads

430

Details

Library Hi Tech News, vol. 30 no. 10

Type: Research Article

DOI:

ISSN: 0741-9058

Open Access

Article

Publication date: 3 August 2020

DNA short read alignment on apache spark

Maryam AlJame and Imtiaz Ahmad

The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has…

HTML

PDF (2 MB)

Downloads

1149

Abstract

The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has introduced several computational challenges. DNA short read alignment is an important problem in bioinformatics. The exponential growth in the number of short reads has increased the need for an ideal platform to accelerate the alignment process. Apache Spark is a cluster-computing framework that involves data parallelism and fault tolerance. In this article, we proposed a Spark-based algorithm to accelerate DNA short reads alignment problem, and it is called Spark-DNAligning. Spark-DNAligning exploits Apache Spark ’s performance optimizations such as broadcast variable, join after partitioning, caching, and in-memory computations. Spark-DNAligning is evaluated in term of performance by comparing it with SparkBWA tool and a MapReduce based algorithm called CloudBurst. All the experiments are conducted on Amazon Web Services (AWS). Results demonstrate that Spark-DNAligning outperforms both tools by providing a speedup in the range of 101–702 in aligning gigabytes of short reads to the human genome. Empirical evaluation reveals that Apache Spark offers promising solutions to DNA short reads alignment problem.

Details

Applied Computing and Informatics, vol. 19 no. 1/2

Type: Research Article

DOI:

ISSN: 2634-1964

Keywords

View access options

Article

Publication date: 19 November 2018

Open Taiwan Government data recommendation platform using DBpedia and Semantic Web based on cloud computing

I-Cheng Chen and I-Ching Hsu

In recent years, governments around the world are actively promoting the Open Government Data (OGD) to facilitate reusing open data and developing information applications…

HTML

PDF (2.8 MB)

Downloads

204

Abstract

Purpose

In recent years, governments around the world are actively promoting the Open Government Data (OGD) to facilitate reusing open data and developing information applications. Currently, there are more than 35,000 data sets available on the Taiwan OGD website. However, the existing Taiwan OGD website only provides keyword queries and lacks a friendly query interface. This study aims to address these issues by defining a DBpedia cloud computing framework (DCCF) for integrating DBpedia with Semantic Web technologies into Spark cluster cloud computing environment.

Design/methodology/approach

The proposed DCCF is used to develop a Taiwan OGD recommendation platform (TOGDRP) that provides a friendly query interface to automatically filter out the relevant data sets and visualize relationships between these data sets.

Findings

To demonstrate the feasibility of TOGDRP, the experimental results illustrate the efficiency of the different cloud computing models, including Hadoop YARN cluster model, Spark standalone cluster model and Spark YARN cluster model.

Originality/value

The novel solution proposed in this study is a hybrid approach for integrating Semantic Web technologies into Hadoop and Spark cloud computing environment to provide OGD data sets recommendation.

Details

International Journal of Web Information Systems, vol. 15 no. 2

Type: Research Article

DOI:

ISSN: 1744-0084

Keywords

View access options

Book part

Publication date: 19 July 2022

Big Data Analytics – Tools and Techniques – Application in the Insurance Sector

Ayesha Banu

Introduction: The Internet has tremendously transformed the computer and networking world. Information reaches our fingertips and adds data to our repository within a second. Big…

HTML

PDF (845 KB)

EPUB (354 KB)

Abstract

Introduction: The Internet has tremendously transformed the computer and networking world. Information reaches our fingertips and adds data to our repository within a second. Big data was initially defined as three Vs, where data come with greater variety, increasing volumes and extra velocity. Big data is a collection of structured, unstructured and semi-structured data gathered from different sources and applications. It has become the most powerful buzzword in almost all the business sectors. The real success of any industry can be counted based on how the big data is analysed, potential knowledge is discovered and productive business decisions are made. New technologies such as artificial intelligence and machine learning have added more efficiency to storing and analysing data. This big data analytics (BDA) becomes more valuable to those companies, focusing on getting insight into customer behaviour, trends and patterns. This popularity of big data has inspired insurance companies to utilise big data at their core systems and advance the financial operations, improve customer service, construct a personalised environment and take all possible measures to increase revenue and profits.

Purpose: This study aims to recognise what big data stands for in the insurance sector and how the application of BDA has opened the door for new and innovative changes in the insurance industry.

Methodology: This study describes the field of BDA in the insurance sector, discusses the benefits, outlines tools, architectural framework, the method, describes applications in general and specific and briefly discusses the opportunities and challenges.

Findings: The study concludes that BDA in insurance is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great; however, there remain challenges to overcome.

Details

Big Data: A Game Changer for Insurance Industry

Type: Book

DOI:

ISBN: 978-1-80262-606-3

Keywords

View access options

Article

Publication date: 18 October 2021

Application of tools to support Linked Open Data

Sujan Saha and Sukumar Mandal

These projects aim to improve library services for users in the future by combining Link Open Data (LOD) technology with data visualization. It displays and analyses search…

HTML

PDF (330 KB)

Downloads

263

Abstract

Purpose

These projects aim to improve library services for users in the future by combining Link Open Data (LOD) technology with data visualization. It displays and analyses search results in an intuitive manner. These services are enhanced by integrating various LOD technologies into the authority control system.

Design/methodology/approach

The technology known as LOD is used to access, recycle, share, exchange and disseminate information, among other things. The applicability of Linked Data technologies for the development of library information services is evaluated in this study.

Findings

Apache Hadoop is used for rapidly storing and processing massive Linked Data data sets. Apache Spark is a free and open-source data processing tool. Hive is a SQL-based data warehouse that enables data scientists to write, read and manage petabytes of data.

Originality/value

The distributed large data storage system Apache HBase does not use SQL. This study’s goal is to search the geographic, authority and bibliographic databases for relevant links found on various websites. When data items are linked together, all of the data bits are linked together as well. The study observed and evaluated the tools and processes and recorded each data item’s URL. As a result, data can be combined across silos, enhanced by third-party data sources and contextualized.

Details

Library Hi Tech News, vol. 38 no. 6

Type: Research Article

DOI:

ISSN: 0741-9058

Keywords

View access options

Article

Publication date: 1 August 2016

Integration and optimization of multiple big data processing platforms

Bao-Rong Chang, Hsiu-Fen Tsai, Yun-Che Tsai, Chin-Fu Kuo and Chi-Chung Chen

The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big…

HTML

PDF (3.3 MB)

Downloads

501

Abstract

Purpose

The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big data environment.

Design/methodology/approach

First, the integration of Apache Hive, Cloudera Impala and BDAS Shark make the platform support SQL-like query. Next, users can access a single interface and select the best performance of big data warehouse platform automatically by the proposed optimizer. Finally, the distributed memory storage system Memcached incorporated into the distributed file system, Apache HDFS, is employed for fast caching query results. Therefore, if users query the same SQL command, the same result responds rapidly from the cache system instead of suffering the repeated searches in a big data warehouse and taking a longer time to retrieve.

Findings

As a result the proposed approach significantly improves the overall performance and dramatically reduces the search time as querying a database, especially applying for the high-repeatable SQL commands under multi-user mode.

Research limitations/implications

Currently, Shark’s latest stable version 0.9.1 does not support the latest versions of Spark and Hive. In addition, this series of software only supports Oracle JDK7. Using Oracle JDK8 or Open JDK will cause serious errors, and some software will be unable to run.

Practical implications

The problem with this system is that some blocks are missing when too many blocks are stored in one result (about 100,000 records). Another problem is that the sequential writing into In-memory cache wastes time.

Originality/value

When the remaining memory capacity is 2 GB or less on each server, Impala and Shark will have a lot of page swapping, causing extremely low performance. When the data scale is larger, it may cause the JVM I/O exception and make the program crash. However, when the remaining memory capacity is sufficient, Shark is faster than Hive and Impala. Impala’s consumption of memory resources is between those of Shark and Hive. This amount of remaining memory is sufficient for Impala’s maximum performance. In this study, each server allocates 20 GB of memory for cluster computing and sets the amount of remaining memory as Level 1: 3 percent (0.6 GB), Level 2: 15 percent (3 GB) and Level 3: 75 percent (15 GB) as the critical points. The program automatically selects Hive when memory is less than 15 percent, Impala at 15 to 75 percent and Shark at more than 75 percent.

Details

Engineering Computations, vol. 33 no. 6

Type: Research Article

DOI:

ISSN: 0264-4401

Keywords

View access options

Article

Publication date: 23 March 2023

Cloud-based big data framework towards strengthening disaster risk reduction: systematic mapping

Mohd Naz’ri Mahrin, Anusuyah Subbarao, Suriayati Chuprat and Nur Azaliah Abu Bakar

Cloud computing promises dependable services offered through next-generation data centres based on virtualization technologies for computation, network and storage. Big Data…

HTML

PDF (1.2 MB)

Downloads

222

Abstract

Purpose

Cloud computing promises dependable services offered through next-generation data centres based on virtualization technologies for computation, network and storage. Big Data Applications have been made viable by cloud computing technologies due to the tremendous expansion of data. Disaster management is one of the areas where big data applications are rapidly being deployed. This study looks at how big data is being used in conjunction with cloud computing to increase disaster risk reduction (DRR). This paper aims to explore and review the existing framework for big data used in disaster management and to provide an insightful view of how cloud-based big data platform toward DRR is applied.

Design/methodology/approach

A systematic mapping study is conducted to answer four research questions with papers related to Big Data Analytics, cloud computing and disaster management ranging from the year 2013 to 2019. A total of 26 papers were finalised after going through five steps of systematic mapping.

Findings

Findings are based on each research question.

Research limitations/implications

A specific study on big data platforms on the application of disaster management, in general is still limited. The lack of study in this field is opened for further research sources.

Practical implications

In terms of technology, research in DRR leverage on existing big data platform is still lacking. In terms of data, many disaster data are available, but scientists still struggle to learn and listen to the data and take more proactive disaster preparedness.

Originality/value

This study shows that a very famous platform selected by researchers is central processing unit based processing, namely, Apache Hadoop. Apache Spark which uses memory processing requires a big capacity of memory, therefore this is less preferred in the world of research.

Details

Journal of Science and Technology Policy Management, vol. 14 no. 6

Type: Research Article

DOI:

ISSN: 2053-4620

Keywords

Access

Year

Content type

1 – 10 of 131

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Details

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information