Search results

1 – 10 of 204
Article
Publication date: 28 July 2020

Sathyaraj R, Ramanathan L, Lavanya K, Balasubramanian V and Saira Banu J

The innovation in big data is increasing day by day in such a way that the conventional software tools face several problems in managing the big data. Moreover, the occurrence of…

Abstract

Purpose

The innovation in big data is increasing day by day in such a way that the conventional software tools face several problems in managing the big data. Moreover, the occurrence of the imbalance data in the massive data sets is a major constraint to the research industry.

Design/methodology/approach

The purpose of the paper is to introduce a big data classification technique using the MapReduce framework based on an optimization algorithm. The big data classification is enabled using the MapReduce framework, which utilizes the proposed optimization algorithm, named chicken-based bacterial foraging (CBF) algorithm. The proposed algorithm is generated by integrating the bacterial foraging optimization (BFO) algorithm with the cat swarm optimization (CSO) algorithm. The proposed model executes the process in two stages, namely, training and testing phases. In the training phase, the big data that is produced from different distributed sources is subjected to parallel processing using the mappers in the mapper phase, which perform the preprocessing and feature selection based on the proposed CBF algorithm. The preprocessing step eliminates the redundant and inconsistent data, whereas the feature section step is done on the preprocessed data for extracting the significant features from the data, to provide improved classification accuracy. The selected features are fed into the reducer for data classification using the deep belief network (DBN) classifier, which is trained using the proposed CBF algorithm such that the data are classified into various classes, and finally, at the end of the training process, the individual reducers present the trained models. Thus, the incremental data are handled effectively based on the training model in the training phase. In the testing phase, the incremental data are taken and split into different subsets and fed into the different mappers for the classification. Each mapper contains a trained model which is obtained from the training phase. The trained model is utilized for classifying the incremental data. After classification, the output obtained from each mapper is fused and fed into the reducer for the classification.

Findings

The maximum accuracy and Jaccard coefficient are obtained using the epileptic seizure recognition database. The proposed CBF-DBN produces a maximal accuracy value of 91.129%, whereas the accuracy values of the existing neural network (NN), DBN, naive Bayes classifier-term frequency–inverse document frequency (NBC-TFIDF) are 82.894%, 86.184% and 86.512%, respectively. The Jaccard coefficient of the proposed CBF-DBN produces a maximal Jaccard coefficient value of 88.928%, whereas the Jaccard coefficient values of the existing NN, DBN, NBC-TFIDF are 75.891%, 79.850% and 81.103%, respectively.

Originality/value

In this paper, a big data classification method is proposed for categorizing massive data sets for meeting the constraints of huge data. The big data classification is performed on the MapReduce framework based on training and testing phases in such a way that the data are handled in parallel at the same time. In the training phase, the big data is obtained and partitioned into different subsets of data and fed into the mapper. In the mapper, the features extraction step is performed for extracting the significant features. The obtained features are subjected to the reducers for classifying the data using the obtained features. The DBN classifier is utilized for the classification wherein the DBN is trained using the proposed CBF algorithm. The trained model is obtained as an output after the classification. In the testing phase, the incremental data are considered for the classification. New data are first split into subsets and fed into the mapper for classification. The trained models obtained from the training phase are used for the classification. The classified results from each mapper are fused and fed into the reducer for the classification of big data.

Details

Data Technologies and Applications, vol. 55 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 1 June 2012

James Powell, Linn Collins, Ariane Eberhardt, David Izraelevitz, Jorge Roman, Thomas Dufresne, Mark Scott, Miriam Blake and Gary Grider

The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of…

Abstract

Purpose

The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of MapReduce. It considers the challenges and risks associated with name matching on such a large‐scale and proposes simple matching heuristics for the reduce process. The resulting semantic graphs of authors link names to publications, and include additional features such as phonetic representations of author last names. The authors believe that this achieves an appropriate level of matching at scale, and enables further matching to be performed with graph analysis tools.

Design/methodology/approach

A topically‐focused collection of metadata records describing peer‐reviewed papers was generated based upon a search. The matching records were harvested and stored in the Hadoop Distributed File System (HDFS) for processing by hadoop. A MapReduce job was written to perform coarse‐grain author name matching, and multiple papers were matched with authors when the names were very similar or identical. Semantic graphs were generated so that the graphs could be analyzed to perform finer grained matching, for example by using other metadata such as subject headings.

Findings

When performing author name matching at scale using MapReduce, the heuristics that determine whether names match should be limited to the rules that yield the most reliable results for matching. Bad rules will result in lots of errors, at scale. MapReduce can also be used to generate or extract other data that might help resolve similar names when stricter rules fail to do so. The authors also found that matching is more reliable within a well‐defined topic domain.

Originality/value

Libraries have some of the same big data challenges as are found in data‐driven science. Big data tools such as hadoop can be used to explore large metadata collections, and these collections can be used as surrogates for other real world, big data problems. MapReduce activities need to be appropriately scoped so as to yield good results, while keeping an eye out for problems in code which can be magnified in the output from a MapReduce job.

Details

Library Hi Tech News, vol. 29 no. 4
Type: Research Article
ISSN: 0741-9058

Keywords

Open Access
Article
Publication date: 4 August 2020

Kanak Meena, Devendra K. Tayal, Oscar Castillo and Amita Jain

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the…

734

Abstract

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Details

Applied Computing and Informatics, vol. 18 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 2 November 2015

Hengliang Shi, Xiaolei Bai and Jianhui Duan

In cloth animation field, the collision detection of fabric under external force is very complex, and difficult to satisfy the needs of reality feeling and real time. The purpose…

Abstract

Purpose

In cloth animation field, the collision detection of fabric under external force is very complex, and difficult to satisfy the needs of reality feeling and real time. The purpose of this paper is to improve reality feeling and real-time requirement.

Design/methodology/approach

This paper puts forward a mass-spring model with building bounding-box in the center of particle, and designs the collision detection algorithm based on Mapreduce. At the same time, a method is proposed to detect collision based on geometric unit.

Findings

The method can quickly detect the intersection of particle and triangle, and then deal with collision response according to the physical characteristics of fabric. Experiment shows that the algorithm improves real-time and authenticity.

Research limitations/implications

Experiments show that 3D fabric simulation can be more efficiency through parallel calculation model − Mapreduce.

Practical implications

This method can improve the reality feeling, and reduce calculation quantity.

Social implications

This collision-detection can be used into more fields such as 3D games, aero simulation training and garments automation.

Originality/value

This model and method have originality, and can be used to 3D animation, digital entertainment, and garment industry.

Details

International Journal of Clothing Science and Technology, vol. 27 no. 6
Type: Research Article
ISSN: 0955-6222

Keywords

Article
Publication date: 14 August 2017

Neha Verma and Jatinder Singh

The purpose of this paper is to explore various limitations of conventional mining systems in extracting useful buying patterns from retail transactional databases flooded with…

1868

Abstract

Purpose

The purpose of this paper is to explore various limitations of conventional mining systems in extracting useful buying patterns from retail transactional databases flooded with Big Data. The key objective is to assist retail business owners to better understand the purchase needs of their customers and hence to attract customers to physical retail stores away from competitor e-commerce websites.

Design/methodology/approach

This paper employs a systematic and category-based review of relevant literature to explore the challenges possessed by Big Data for retail industry followed by discussion and implementation of association between MapReduce based Apriori association mining and Hadoop-based intelligent cloud architecture.

Findings

The findings reveal that conventional mining algorithms have not evolved to support Big Data analysis as required by modern retail businesses. They require a lot of resources such as memory and computational engines. This study aims to develop MR-Apriori algorithm in the form of IRM tool to address all these issues in an efficient manner.

Research limitations/implications

The paper suggests that a lot of research is yet to be done in market basket analysis, if full potential of cloud-based Big Data framework is required to be utilized.

Originality/value

This research arms the retail business owners with innovative IRM tool to easily extract comprehensive knowledge of useful buying patterns of customers to increase profits. This study experimentally verifies the effectiveness of proposed algorithm.

Details

Industrial Management & Data Systems, vol. 117 no. 7
Type: Research Article
ISSN: 0263-5577

Keywords

Article
Publication date: 6 August 2021

Alexander Döschl, Max-Emanuel Keller and Peter Mandl

This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing…

Abstract

Purpose

This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and resilient distributed data set (RDD) (Apache Spark) paradigms and a graphics processing unit (GPU) approach with Numba for compute unified device architecture (CUDA).

Design/methodology/approach

The paper uses a simple but computationally intensive puzzle as a case study for experiments. To find all solutions using brute force search, 15! permutations had to be computed and tested against the solution rules. The experimental application comprises a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) paradigms and a GPU approach with Numba for CUDA. The implementations were benchmarked on Amazon-EC2 instances for performance and scalability measurements.

Findings

The comparison of the solutions with Apache Hadoop and Apache Spark under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30% lower, while the performance of Spark especially benefits from an increasing number of tasks. With the CUDA implementation, more than 16 times faster execution is achievable for the same price compared to the Spark solution. Apart from the multi-threaded implementation, the processing times of all solutions scale approximately linearly. Finally, several application suggestions for the different parallelization approaches are derived from the insights of this study.

Originality/value

There are numerous studies that have examined the performance of parallelization approaches. Most of these studies deal with processing large amounts of data or mathematical problems. This work, in contrast, compares these technologies on their ability to implement computationally intensive distributed algorithms.

Details

International Journal of Web Information Systems, vol. 17 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 20 August 2019

Sandhya N., Philip Samuel and Mariamma Chacko

Telecommunication has a decisive role in the development of technology in the current era. The number of mobile users with multiple SIM cards is increasing every second. Hence…

Abstract

Purpose

Telecommunication has a decisive role in the development of technology in the current era. The number of mobile users with multiple SIM cards is increasing every second. Hence, telecommunication is a significant area in which big data technologies are needed. Competition among the telecommunication companies is high due to customer churn. Customer retention in telecom companies is one of the major problems. The paper aims to discuss this issue.

Design/methodology/approach

The authors recommend an Intersection-Randomized Algorithm (IRA) using MapReduce functions to avoid data duplication in the mobile user call data of telecommunication service providers. The authors use the agent-based model (ABM) to predict the complex mobile user behaviour to prevent customer churn with a particular telecommunication service provider.

Findings

The agent-based model increases the prediction accuracy due to the dynamic nature of agents. ABM suggests rules based on mobile user variable features using multiple agents.

Research limitations/implications

The authors have not considered the microscopic behaviour of the customer churn based on complex user behaviour.

Practical implications

This paper shows the effectiveness of the IRA along with the agent-based model to predict the mobile user churn behaviour. The advantage of this proposed model is as follows: the user churn prediction system is straightforward, cost-effective, flexible and distributed with good business profit.

Originality/value

This paper shows the customer churn prediction of complex human behaviour in an effective and flexible manner in a distributed environment using Intersection-Randomized MapReduce Algorithm using agent-based model.

Details

Data Technologies and Applications, vol. 53 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Open Access
Article
Publication date: 12 March 2018

Hafiz A. Alaka, Lukumon O. Oyedele, Hakeem A. Owolabi, Muhammad Bilal, Saheed O. Ajayi and Olugbenga O. Akinade

This study explored use of big data analytics (BDA) to analyse data of a large number of construction firms to develop a construction business failure prediction model (CB-FPM)…

Abstract

This study explored use of big data analytics (BDA) to analyse data of a large number of construction firms to develop a construction business failure prediction model (CB-FPM). Careful analysis of literature revealed financial ratios as the best form of variable for this problem. Because of MapReduce’s unsuitability for iteration problems involved in developing CB-FPMs, various BDA initiatives for iteration problems were identified. A BDA framework for developing CB-FPM was proposed. It was validated by using 150,000 datacells of 30,000 construction firms, artificial neural network, Amazon Elastic Compute Cloud, Apache Spark and the R software. The BDA CB-FPM was developed in eight seconds while the same process without BDA was aborted after nine hours without success. This shows the issue of not wanting to use large dataset to develop CB-FPM due to tedious duration is resolvable by applying BDA technique. The BDA CB-FPM largely outperformed an ordinary CB-FPM developed with a dataset of 200 construction firms, proving that use of larger sample size with the aid of BDA, leads to better performing CB-FPMs. The high financial and social cost associated with misclassifications (i.e. model error) thus makes adoption of BDA CB-FPMs very important for, among others, financiers, clients and policy makers.

Details

Applied Computing and Informatics, vol. 16 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 21 May 2018

Suganeshwari G., Syed Ibrahim S.P. and Gang Li

The purpose of this paper is to address the scalability issue and produce high-quality recommendation that best matches the user’s current preference in the dynamically growing…

Abstract

Purpose

The purpose of this paper is to address the scalability issue and produce high-quality recommendation that best matches the user’s current preference in the dynamically growing datasets in the context of memory-based collaborative filtering methods using temporal information.

Design/methodology/approach

The proposed method is formalized as time-dependent collaborative filtering method. For each item, a set of influential neighbors is identified by using the truncated version of similarity computation based on the timestamp. Then, recent n transactions are used to generate the recommendation that reflect the recent preference of the active user. The proposed method, lazy collaborative filtering with dynamic neighborhoods (LCFDN), is further scaled up by implementing in spark using parallel processing paradigm MapReduce. The experiments conducted on MovieLens dataset reveal that LCFDN implemented on MapReduce is more efficient and achieves good performance than the existing methods.

Findings

The results of the experimental study clearly show that not all ratings provide valuable information. Recommendation system based on LCFDN increases the efficiency of predictions by selecting the most influential neighbors based on the temporal information. The pruning of the recent transactions of the user also addresses the user’s preference drifts and is more scalable when compared to state-of-art methods.

Research limitations/implications

In the proposed method, LCFDN, the neighborhood space is dynamically adjusted based on the temporal information. In addition, the LCFDN also determines the user’s current interest based on the recent preference or purchase details. This method is designed to continuously track the user’s preference with the growing dataset which makes it suitable to be implemented in the e-commerce industry. Compared with the state-of-art methods, this method provides high-quality recommendation with good efficiency.

Originality/value

The LCFDN is an extension of collaborative filtering with temporal information used as context. The dynamic nature of data and user’s preference drifts are addressed in the proposed method by dynamically adapting the neighbors. To improve the scalability, the proposed method is implemented in big data environment using MapReduce. The proposed recommendation system provides greater prediction accuracy than the traditional recommender systems.

Details

Information Discovery and Delivery, vol. 46 no. 2
Type: Research Article
ISSN: 2398-6247

Keywords

Open Access
Article
Publication date: 3 August 2020

Maryam AlJame and Imtiaz Ahmad

The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has…

1140

Abstract

The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has introduced several computational challenges. DNA short read alignment is an important problem in bioinformatics. The exponential growth in the number of short reads has increased the need for an ideal platform to accelerate the alignment process. Apache Spark is a cluster-computing framework that involves data parallelism and fault tolerance. In this article, we proposed a Spark-based algorithm to accelerate DNA short reads alignment problem, and it is called Spark-DNAligning. Spark-DNAligning exploits Apache Spark ’s performance optimizations such as broadcast variable, join after partitioning, caching, and in-memory computations. Spark-DNAligning is evaluated in term of performance by comparing it with SparkBWA tool and a MapReduce based algorithm called CloudBurst. All the experiments are conducted on Amazon Web Services (AWS). Results demonstrate that Spark-DNAligning outperforms both tools by providing a speedup in the range of 101–702 in aligning gigabytes of short reads to the human genome. Empirical evaluation reveals that Apache Spark offers promising solutions to DNA short reads alignment problem.

Details

Applied Computing and Informatics, vol. 19 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

1 – 10 of 204