Search results
1 – 10 of 105This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P…
Abstract
Purpose
This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.
Design/methodology/approach
In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.
Findings
The authors got very satisfactory classification results.
Originality/value
DDPML system is specially designed to smoothly handle big data mining classification.
Details
Keywords
Ilija Subasic, Nebojsa Gvozdenovic and Kris Jack
The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate…
Abstract
Purpose
The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm.
Design/methodology/approach
The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm.
Findings
The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, and that it can scale to deal with hundreds of million data points.
Research limitations/implications
Academic researchers can use this paper to understand some of the issues of transitivity in duplicate detection, and its effects on digital catalogue generations.
Practical implications
Industry practitioners can use this paper as a use case study for generating a large-scale real-life catalogue generation system that deals with millions of records in a scalable and efficient way.
Originality/value
In contrast to other similarity calculation algorithms developed for m/r frameworks the authors present a specific variant of similarity calculation that is optimized for duplicate detection of bibliographic records by extending previously proposed e-algorithm based on inverted index creation. In addition, the authors are concerned with more than duplicate detection, and investigate how to group detected duplicates. The authors develop distinct algorithms for duplicate detection and duplicate clustering and use the canopy clustering idea for multi-pass clustering. The work extends the current state-of-the-art by including the duplicate clustering step and demonstrate new strategies for speeding up m/r similarity calculations.
Details
Keywords
Alejandro Vera-Baquero, Ricardo Colomo Palacios, Vladimir Stantchev and Owen Molloy
This paper aims to present a solution that enables organizations to monitor and analyse the performance of their business processes by means of Big Data technology. Business…
Abstract
Purpose
This paper aims to present a solution that enables organizations to monitor and analyse the performance of their business processes by means of Big Data technology. Business process improvement can drastically influence in the profit of corporations and helps them to remain viable. However, the use of traditional Business Intelligence systems is not sufficient to meet today ' s business needs. They normally are business domain-specific and have not been sufficiently process-aware to support the needs of process improvement-type activities, especially on large and complex supply chains, where it entails integrating, monitoring and analysing a vast amount of dispersed event logs, with no structure, and produced on a variety of heterogeneous environments. This paper tackles this variability by devising different Big-Data-based approaches that aim to gain visibility into process performance.
Design/methodology/approach
Authors present a cloud-based solution that leverages (BD) technology to provide essential insights into business process improvement. The proposed solution is aimed at measuring and improving overall business performance, especially in very large and complex cross-organisational business processes, where this type of visibility is hard to achieve across heterogeneous systems.
Findings
Three different (BD) approaches have been undertaken based on Hadoop and HBase. We introduced first, a map-reduce approach that it is suitable for batch processing and presents a very high scalability. Secondly, we have described an alternative solution by integrating the proposed system with Impala. This approach has significant improvements in respect with map reduce as it is focused on performing real-time queries over HBase. Finally, the use of secondary indexes has been also proposed with the aim of enabling immediate access to event instances for correlation in detriment of high duplication storage and synchronization issues. This approach has produced remarkable results in two real functional environments presented in the paper.
Originality/value
The value of the contribution relies on the comparison and integration of software packages towards an integrated solution that is aimed to be adopted by industry. Apart from that, in this paper, authors illustrate the deployment of the architecture in two different settings.
Details
Keywords
Introduction: The Internet has tremendously transformed the computer and networking world. Information reaches our fingertips and adds data to our repository within a second. Big…
Abstract
Introduction: The Internet has tremendously transformed the computer and networking world. Information reaches our fingertips and adds data to our repository within a second. Big data was initially defined as three Vs, where data come with greater variety, increasing volumes and extra velocity. Big data is a collection of structured, unstructured and semi-structured data gathered from different sources and applications. It has become the most powerful buzzword in almost all the business sectors. The real success of any industry can be counted based on how the big data is analysed, potential knowledge is discovered and productive business decisions are made. New technologies such as artificial intelligence and machine learning have added more efficiency to storing and analysing data. This big data analytics (BDA) becomes more valuable to those companies, focusing on getting insight into customer behaviour, trends and patterns. This popularity of big data has inspired insurance companies to utilise big data at their core systems and advance the financial operations, improve customer service, construct a personalised environment and take all possible measures to increase revenue and profits.
Purpose: This study aims to recognise what big data stands for in the insurance sector and how the application of BDA has opened the door for new and innovative changes in the insurance industry.
Methodology: This study describes the field of BDA in the insurance sector, discusses the benefits, outlines tools, architectural framework, the method, describes applications in general and specific and briefly discusses the opportunities and challenges.
Findings: The study concludes that BDA in insurance is evolving into a promising field for providing insight from very large data sets and improving outcomes while reducing costs. Its potential is great; however, there remain challenges to overcome.
Details
Keywords
Laouni Djafri, Djamel Amar Bensaber and Reda Adjoudj
This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in…
Abstract
Purpose
This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in the shortest possible time.
Design/methodology/approach
This paper is divided into two parts. The first one is to improve the result of the prediction. In this part, two ideas are proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratified random sampling method to obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies solutions, which in turn works in a coherent and efficient way with the sampling strategy under the supervision of the Map-Reduce algorithm.
Findings
The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were supported by the improved random forests supervised learning method, which played a key role in this context.
Originality/value
All companies are concerned, especially those with large amounts of information and want to screen them to improve their knowledge for the customer and optimize their campaigns.
Details
Keywords
Expert profiling plays an important role in expert finding for collaborative innovation in research social networking platforms. Dynamic changes in scientific knowledge have posed…
Abstract
Purpose
Expert profiling plays an important role in expert finding for collaborative innovation in research social networking platforms. Dynamic changes in scientific knowledge have posed significant challenges on expert profiling. Current approaches mostly rely on knowledge of other experts, contents of static web pages or their behavior and thus overlook the insight of big social data generated through crowdsourcing in research social networks and scientific data sources. In light of this deficiency, this research proposes a big data-based approach that harnesses collective intelligence of crowd in (research) social networking platforms and scientific databases for expert profiling.
Design/methodology/approach
A big data analytics approach which uses crowdsourcing is designed and developed for expert profiling. The proposed approach interconnects big data sources covering publication data, project data and data from social networks (i.e. posts, updates and endorsements collected through the crowdsourcing). Large volume of structured data representing scientific knowledge is available in Web of Science, Scopus, CNKI and ACM digital library; they are considered as publication data in this research context. Project data are located at the databases hosted by funding agencies. The authors follow the Map-Reduce strategy to extract real-time data from all these sources. Two main steps, features mining and profile consolidation (the details of which are outlined in the manuscript), are followed to generate comprehensive user profiles. The major tasks included in features mining are processing of big data sources to extract representational features of profiles, entity-profile generation and social-profile generation through crowd-opinion mining. At the profile consolidation, two profiles, namely, entity-profile and social-profile, are conflated.
Findings
(1) The integration of crowdsourcing techniques with big research data analytics has improved high graded relevance of the constructed profiles. (2) A system to construct experts’ profiles based on proposed methods has been incorporated into an operational system called ScholarMate (www.scholarmate.com).
Research limitations
One shortcoming is currently we have conducted experiments using sampling strategy. In the future we will perform controlled experiments of large scale and field tests to validate and comprehensively evaluate our design artifacts.
Practical implications
The business implication of this research work is that the developed methods and the system can be applied to streamline human capital management in organizations.
Originality/value
The proposed approach interconnects opinions of crowds on one’s expertise with corresponding expertise demonstrated in scientific knowledge bases to construct comprehensive profiles. This is a novel approach which alleviates problems associated with existing methods. The authors’ team has developed an expert profiling system operational in ScholarMate research social network (www.scholarmate.com), which is a professional research social network that connects people to research with the aim of “innovating smarter” and was launched in 2007.
Details
Keywords
Resilient distributed processing technique (RDPT), in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.
Abstract
Purpose
Resilient distributed processing technique (RDPT), in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.
Design/methodology/approach
The proposed work is implemented with Pig Latin with Spark contexts to develop query processing in a distributed environment.
Findings
Query processing in Hadoop influences the distributed processing with the MapReduce model. MapReduce caters to the works on different nodes with the implementation of complex mappers and reducers. Its results are valid for some extent size of the data.
Originality/value
Pig supports the required parallel processing framework with the following constructs during the processing of queries: FOREACH; FLATTEN; COGROUP.
Details
Keywords
Ramakrishna Guttula and Venkateswara Rao Nandanavanam
Microstrip patch antenna is generally used for several communication purposes particularly in the military and civilian applications. Even though several techniques have been made…
Abstract
Purpose
Microstrip patch antenna is generally used for several communication purposes particularly in the military and civilian applications. Even though several techniques have been made numerous achievements in several fields, some systems require additional improvements to meet few challenges. Yet, they require application-specific improvement for optimally designing microstrip patch antenna. The paper aims to discuss these issues.
Design/methodology/approach
This paper intends to adopt an advanced meta-heuristic search algorithm called as grey wolf optimization (GWO), which is said to be inspired by the hunting behaviour of grey wolves, for the design of patch antenna parameters. The searching for the optimal design of the antenna is paced up using the opposition-based solution search. Moreover, the proposed model derives a nonlinear objective model to aid the design of the solution space of antenna parameters. After executing the simulation model, this paper compares the performance of the proposed GWO-based microstrip patch antenna with several conventional models.
Findings
The gain of the proposed model is 27.05 per cent better than WOAD, 2.07 per cent better than AAD, 15.80 per cent better than GAD, 17.49 per cent better than PSAD and 3.77 per cent better than GWAD model. Thus, it has proved that the proposed antenna model has attained high gain, leads to cause superior performance.
Originality/value
This paper presents a technique for designing the microstrip patch antenna, using the proposed GWO algorithm. This is the first work utilizes GWO-based optimization for microstrip patch antenna.
Details
Keywords
Cong-Phuoc Phan, Hong-Quang Nguyen and Tan-Tai Nguyen
Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its…
Abstract
Purpose
Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its potential, searching these patent documents has increasingly become an important topic. Although much research has processed a large size of collections, a few studies have attempted to integrate both patent classifications and specifications for analyzing user queries. Consequently, the queries are often insufficiently analyzed for improving the accuracy of search results. This paper aims to address such limitation by exploiting semantic relationships between patent contents and their classification.
Design/methodology/approach
The contributions are fourfold. First, the authors enhance similarity measurement between two short sentences and make it 20 per cent more accurate. Second, the Graph-embedded Tree ontology is enriched by integrating both patent documents and classification scheme. Third, the ontology does not rely on rule-based method or text matching; instead, an heuristic meaning comparison to extract semantic relationships between concepts is applied. Finally, the patent search approach uses the ontology effectively with the results sorted based on their most common order.
Findings
The experiment on searching for 600 patent documents in the field of Logistics brings better 15 per cent in terms of F-Measure when compared with traditional approaches.
Research limitations/implications
The research, however, still requires improvement in which the terms and phrases extracted by Noun and Noun phrases making less sense in some aspect and thus might not result in high accuracy. The large collection of extracted relationships could be further optimized for its conciseness. In addition, parallel processing such as Map-Reduce could be further used to improve the search processing performance.
Practical implications
The experimental results could be used for scientists and technologists to search for novel, non-obvious technologies in the patents.
Social implications
High quality of patent search results will reduce the patent infringement.
Originality/value
The proposed ontology is semantically enriched by integrating both patent documents and their classification. This ontology facilitates the analysis of the user queries for enhancing the accuracy of the patent search results.
Details
Keywords
Jianpeng Zhang and Mingwei Lin
The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for…
Abstract
Purpose
The purpose of this paper is to make an overview of 6,618 publications of Apache Hadoop from 2008 to 2020 in order to provide a conclusive and comprehensive analysis for researchers in this field, as well as a preliminary knowledge of Apache Hadoop for interested researchers.
Design/methodology/approach
This paper employs the bibliometric analysis and visual analysis approaches to systematically study and analyze publications about Apache Hadoop in the Web of Science database. This study aims to investigate the topic of Apache Hadoop by means of bibliometric analysis with the aid of visualization applications. Through the bibliometric analysis of the collected documents, this paper analyzes the main statistical characteristics and cooperation networks. Research themes, research hotspots and future development trends are also investigated through the keyword analysis.
Findings
The research on Apache Hadoop is still the top priority in the future, and how to improve the performance of Apache Hadoop in the era of big data is one of the research hotspots.
Research limitations/implications
This paper makes a comprehensive analysis of Apache Hadoop with methods of bibliometrics, and it is valuable for researchers can quickly grasp the hot topics in this area.
Originality/value
This paper draws the structural characteristics of the publications in this field and summarizes the research hotspots and trends in this field in recent years, aiming to understand the development status and trends in this field and inspire new ideas for researchers.
Details