Search results

1 – 10 of 105
Article
Publication date: 21 December 2021

Laouni Djafri

This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P…

384

Abstract

Purpose

This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.

Design/methodology/approach

In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.

Findings

The authors got very satisfactory classification results.

Originality/value

DDPML system is specially designed to smoothly handle big data mining classification.

Details

Data Technologies and Applications, vol. 56 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 4 April 2016

Ilija Subasic, Nebojsa Gvozdenovic and Kris Jack

The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate…

Abstract

Purpose

The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm.

Design/methodology/approach

The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm.

Findings

The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, and that it can scale to deal with hundreds of million data points.

Research limitations/implications

Academic researchers can use this paper to understand some of the issues of transitivity in duplicate detection, and its effects on digital catalogue generations.

Practical implications

Industry practitioners can use this paper as a use case study for generating a large-scale real-life catalogue generation system that deals with millions of records in a scalable and efficient way.

Originality/value

In contrast to other similarity calculation algorithms developed for m/r frameworks the authors present a specific variant of similarity calculation that is optimized for duplicate detection of bibliographic records by extending previously proposed e-algorithm based on inverted index creation. In addition, the authors are concerned with more than duplicate detection, and investigate how to group detected duplicates. The authors develop distinct algorithms for duplicate detection and duplicate clustering and use the canopy clustering idea for multi-pass clustering. The work extends the current state-of-the-art by including the duplicate clustering step and demonstrate new strategies for speeding up m/r similarity calculations.

Details

Program, vol. 50 no. 2
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 11 May 2015

Alejandro Vera-Baquero, Ricardo Colomo Palacios, Vladimir Stantchev and Owen Molloy

This paper aims to present a solution that enables organizations to monitor and analyse the performance of their business processes by means of Big Data technology. Business…

3099

Abstract

Purpose

This paper aims to present a solution that enables organizations to monitor and analyse the performance of their business processes by means of Big Data technology. Business process improvement can drastically influence in the profit of corporations and helps them to remain viable. However, the use of traditional Business Intelligence systems is not sufficient to meet today ' s business needs. They normally are business domain-specific and have not been sufficiently process-aware to support the needs of process improvement-type activities, especially on large and complex supply chains, where it entails integrating, monitoring and analysing a vast amount of dispersed event logs, with no structure, and produced on a variety of heterogeneous environments. This paper tackles this variability by devising different Big-Data-based approaches that aim to gain visibility into process performance.

Design/methodology/approach

Authors present a cloud-based solution that leverages (BD) technology to provide essential insights into business process improvement. The proposed solution is aimed at measuring and improving overall business performance, especially in very large and complex cross-organisational business processes, where this type of visibility is hard to achieve across heterogeneous systems.

Findings

Three different (BD) approaches have been undertaken based on Hadoop and HBase. We introduced first, a map-reduce approach that it is suitable for batch processing and presents a very high scalability. Secondly, we have described an alternative solution by integrating the proposed system with Impala. This approach has significant improvements in respect with map reduce as it is focused on performing real-time queries over HBase. Finally, the use of secondary indexes has been also proposed with the aim of enabling immediate access to event instances for correlation in detriment of high duplication storage and synchronization issues. This approach has produced remarkable results in two real functional environments presented in the paper.

Originality/value

The value of the contribution relies on the comparison and integration of software packages towards an integrated solution that is aimed to be adopted by industry. Apart from that, in this paper, authors illustrate the deployment of the architecture in two different settings.

Details

The Learning Organization, vol. 22 no. 4
Type: Research Article
ISSN: 0969-6474

Keywords

Book part
Publication date: 19 July 2022

Ayesha Banu

Introduction: The Internet has tremendously transformed the computer and networking world. Information reaches our fingertips and adds data to our repository within a second. Big…

Abstract

Introduction: The Internet has tremendously transformed the computer and networking world. Information reaches our fingertips and adds data to our repository within a second. Big data was initially defined as three Vs, where data come with greater variety, increasing volumes and extra velocity. Big data is a collection of structured, unstructured and semi-structured data gathered from different sources and applications. It has become the most powerful buzzword in almost all the business sectors. The real success of any industry can be counted based on how the big data is analysed, potential knowledge is discovered and productive business decisions are made. New technologies such as artificial intelligence and machine learning have added more efficiency to storing and analysing data. This big data analytics (BDA) becomes more valuable to those companies, focusing on getting insight into customer behaviour, trends and patterns. This popularity of big data has inspired insurance companies to utilise big data at their core systems and advance the financial operations, improve customer service, construct a personalised environment and take all possible measures to increase revenue and profits.

Purpose: This study aims to recognise what big data stands for in the insurance sector and how the application of BDA has opened the door for new and innovative changes in the insurance industry.

Methodology: This study describes the field of BDA in the insurance sector, discusses the benefits, outlines tools, architectural framework, the method, describes applications in general and specific and briefly discusses the opportunities and challenges.

Findings: The study concludes that BDA in insurance is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great; however, there remain challenges to overcome.

Details

Big Data: A Game Changer for Insurance Industry
Type: Book
ISBN: 978-1-80262-606-3

Keywords

Article
Publication date: 20 August 2018

Laouni Djafri, Djamel Amar Bensaber and Reda Adjoudj

This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in…

Abstract

Purpose

This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in the shortest possible time.

Design/methodology/approach

This paper is divided into two parts. The first one is to improve the result of the prediction. In this part, two ideas are proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratified random sampling method to obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies solutions, which in turn works in a coherent and efficient way with the sampling strategy under the supervision of the Map-Reduce algorithm.

Findings

The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were supported by the improved random forests supervised learning method, which played a key role in this context.

Originality/value

All companies are concerned, especially those with large amounts of information and want to screen them to improve their knowledge for the customer and optimize their campaigns.

Details

Information Discovery and Delivery, vol. 46 no. 3
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 20 November 2017

Thushari Silva and Jian Ma

Expert profiling plays an important role in expert finding for collaborative innovation in research social networking platforms. Dynamic changes in scientific knowledge have posed…

1054

Abstract

Purpose

Expert profiling plays an important role in expert finding for collaborative innovation in research social networking platforms. Dynamic changes in scientific knowledge have posed significant challenges on expert profiling. Current approaches mostly rely on knowledge of other experts, contents of static web pages or their behavior and thus overlook the insight of big social data generated through crowdsourcing in research social networks and scientific data sources. In light of this deficiency, this research proposes a big data-based approach that harnesses collective intelligence of crowd in (research) social networking platforms and scientific databases for expert profiling.

Design/methodology/approach

A big data analytics approach which uses crowdsourcing is designed and developed for expert profiling. The proposed approach interconnects big data sources covering publication data, project data and data from social networks (i.e. posts, updates and endorsements collected through the crowdsourcing). Large volume of structured data representing scientific knowledge is available in Web of Science, Scopus, CNKI and ACM digital library; they are considered as publication data in this research context. Project data are located at the databases hosted by funding agencies. The authors follow the Map-Reduce strategy to extract real-time data from all these sources. Two main steps, features mining and profile consolidation (the details of which are outlined in the manuscript), are followed to generate comprehensive user profiles. The major tasks included in features mining are processing of big data sources to extract representational features of profiles, entity-profile generation and social-profile generation through crowd-opinion mining. At the profile consolidation, two profiles, namely, entity-profile and social-profile, are conflated.

Findings

(1) The integration of crowdsourcing techniques with big research data analytics has improved high graded relevance of the constructed profiles. (2) A system to construct experts’ profiles based on proposed methods has been incorporated into an operational system called ScholarMate (www.scholarmate.com).

Research limitations

One shortcoming is currently we have conducted experiments using sampling strategy. In the future we will perform controlled experiments of large scale and field tests to validate and comprehensively evaluate our design artifacts.

Practical implications

The business implication of this research work is that the developed methods and the system can be applied to streamline human capital management in organizations.

Originality/value

The proposed approach interconnects opinions of crowds on one’s expertise with corresponding expertise demonstrated in scientific knowledge bases to construct comprehensive profiles. This is a novel approach which alleviates problems associated with existing methods. The authors’ team has developed an expert profiling system operational in ScholarMate research social network (www.scholarmate.com), which is a professional research social network that connects people to research with the aim of “innovating smarter” and was launched in 2007.

Details

Information Discovery and Delivery, vol. 45 no. 4
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 19 February 2021

C. Lakshmi and K. Usha Rani

Resilient distributed processing technique (RDPT), in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.

Abstract

Purpose

Resilient distributed processing technique (RDPT), in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.

Design/methodology/approach

The proposed work is implemented with Pig Latin with Spark contexts to develop query processing in a distributed environment.

Findings

Query processing in Hadoop influences the distributed processing with the MapReduce model. MapReduce caters to the works on different nodes with the implementation of complex mappers and reducers. Its results are valid for some extent size of the data.

Originality/value

Pig supports the required parallel processing framework with the following constructs during the processing of queries: FOREACH; FLATTEN; COGROUP.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 15 January 2020

Ramakrishna Guttula and Venkateswara Rao Nandanavanam

Microstrip patch antenna is generally used for several communication purposes particularly in the military and civilian applications. Even though several techniques have been made…

Abstract

Purpose

Microstrip patch antenna is generally used for several communication purposes particularly in the military and civilian applications. Even though several techniques have been made numerous achievements in several fields, some systems require additional improvements to meet few challenges. Yet, they require application-specific improvement for optimally designing microstrip patch antenna. The paper aims to discuss these issues.

Design/methodology/approach

This paper intends to adopt an advanced meta-heuristic search algorithm called as grey wolf optimization (GWO), which is said to be inspired by the hunting behaviour of grey wolves, for the design of patch antenna parameters. The searching for the optimal design of the antenna is paced up using the opposition-based solution search. Moreover, the proposed model derives a nonlinear objective model to aid the design of the solution space of antenna parameters. After executing the simulation model, this paper compares the performance of the proposed GWO-based microstrip patch antenna with several conventional models.

Findings

The gain of the proposed model is 27.05 per cent better than WOAD, 2.07 per cent better than AAD, 15.80 per cent better than GAD, 17.49 per cent better than PSAD and 3.77 per cent better than GWAD model. Thus, it has proved that the proposed antenna model has attained high gain, leads to cause superior performance.

Originality/value

This paper presents a technique for designing the microstrip patch antenna, using the proposed GWO algorithm. This is the first work utilizes GWO-based optimization for microstrip patch antenna.

Details

Data Technologies and Applications, vol. 54 no. 1
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 3 December 2018

Cong-Phuoc Phan, Hong-Quang Nguyen and Tan-Tai Nguyen

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its…

Abstract

Purpose

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its potential, searching these patent documents has increasingly become an important topic. Although much research has processed a large size of collections, a few studies have attempted to integrate both patent classifications and specifications for analyzing user queries. Consequently, the queries are often insufficiently analyzed for improving the accuracy of search results. This paper aims to address such limitation by exploiting semantic relationships between patent contents and their classification.

Design/methodology/approach

The contributions are fourfold. First, the authors enhance similarity measurement between two short sentences and make it 20 per cent more accurate. Second, the Graph-embedded Tree ontology is enriched by integrating both patent documents and classification scheme. Third, the ontology does not rely on rule-based method or text matching; instead, an heuristic meaning comparison to extract semantic relationships between concepts is applied. Finally, the patent search approach uses the ontology effectively with the results sorted based on their most common order.

Findings

The experiment on searching for 600 patent documents in the field of Logistics brings better 15 per cent in terms of F-Measure when compared with traditional approaches.

Research limitations/implications

The research, however, still requires improvement in which the terms and phrases extracted by Noun and Noun phrases making less sense in some aspect and thus might not result in high accuracy. The large collection of extracted relationships could be further optimized for its conciseness. In addition, parallel processing such as Map-Reduce could be further used to improve the search processing performance.

Practical implications

The experimental results could be used for scientists and technologists to search for novel, non-obvious technologies in the patents.

Social implications

High quality of patent search results will reduce the patent infringement.

Originality/value

The proposed ontology is semantically enriched by integrating both patent documents and their classification. This ontology facilitates the analysis of the user queries for enhancing the accuracy of the patent search results.

Details

International Journal of Web Information Systems, vol. 15 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 26 April 2022

Jianpeng Zhang and Mingwei Lin

The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for…

Abstract

Purpose

The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field, as well as a preliminary knowledge of Apache Hadoop for interested researchers.

Design/methodology/approach

This paper employs the bibliometric analysis and visual analysis approaches to systematically study and analyze publications about Apache Hadoop in the Web of Science database. This study aims to investigate the topic of Apache Hadoop by means of bibliometric analysis with the aid of visualization applications. Through the bibliometric analysis of the collected documents, this paper analyzes the main statistical characteristics and cooperation networks. Research themes, research hotspots and future development trends are also investigated through the keyword analysis.

Findings

The research on Apache Hadoop is still the top priority in the future, and how to improve the performance of Apache Hadoop in the era of big data is one of the research hotspots.

Research limitations/implications

This paper makes a comprehensive analysis of Apache Hadoop with methods of bibliometrics, and it is valuable for researchers can quickly grasp the hot topics in this area.

Originality/value

This paper draws the structural characteristics of the publications in this field and summarizes the research hotspots and trends in this field in recent years, aiming to understand the development status and trends in this field and inspire new ideas for researchers.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 16 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

1 – 10 of 105