Search results
1 – 10 of over 75000
Resilient distributed processing technique (RDPT), in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.
Abstract
Purpose
Resilient distributed processing technique (RDPT), in which mapper and reducer are simplified with the Spark contexts and support distributed parallel query processing.
Design/methodology/approach
The proposed work is implemented with Pig Latin with Spark contexts to develop query processing in a distributed environment.
Findings
Query processing in Hadoop influences the distributed processing with the MapReduce model. MapReduce caters to the works on different nodes with the implementation of complex mappers and reducers. Its results are valid for some extent size of the data.
Originality/value
Pig supports the required parallel processing framework with the following constructs during the processing of queries: FOREACH; FLATTEN; COGROUP.
Details
Keywords
Alasdair J.G. Gray, Werner Nutt and M. Howard Williams
Distributed data streams are an important topic of current research. In such a setting, data values will be missed, e.g. due to network errors. This paper aims to allow this…
Abstract
Purpose
Distributed data streams are an important topic of current research. In such a setting, data values will be missed, e.g. due to network errors. This paper aims to allow this incompleteness to be detected and overcome with either the user not being affected or the effects of the incompleteness being reported to the user.
Design/methodology/approach
A model for representing the incomplete information has been developed that captures the information that is known about the missing data. Techniques for query answering involving certain and possible answer sets have been extended so that queries over incomplete data stream histories can be answered.
Findings
It is possible to detect when a distributed data stream is missing one or more values. When such data values are missing there will be some information that is known about the data and this is stored in an appropriate format. Even when the available data are incomplete, it is possible in some circumstances to answer a query completely. When this is not possible, additional meta‐data can be returned to inform the user of the effects of the incompleteness.
Research limitations/implications
The techniques and models proposed in this paper have only been partially implemented.
Practical implications
The proposed system is general and can be applied wherever there is a need to query the history of distributed data streams. The work in this paper enables the system to answer queries when there are missing values in the data.
Originality/value
This paper presents a general model of how to detect, represent, and answer historical queries over incomplete distributed data streams.
Details
Keywords
Jinbao Li, Yingshu Li, My T. Thai and Jianzhong Li
This paper investigates query processing in MANETs. Cache techniques and multi‐join database operations are studied. For data caching, a group‐caching strategy is proposed. Using…
Abstract
This paper investigates query processing in MANETs. Cache techniques and multi‐join database operations are studied. For data caching, a group‐caching strategy is proposed. Using the cache and the index of the cached data, queries can be processed at a single node or within the group containing this single node. For multi‐join, a cost evaluation model and a query plan generation algorithm are presented. Query cost is evaluated based on the parameters including the size of the transmitted data, the transmission distance and the query cost at each single node. According to the evaluations, the nodes on which the query should be executed and the join order are determined. Theoretical analysis and experiment results show that the proposed group‐caching based query processing and the cost based join strategy are efficient in MANETs. It is suitable for the mobility, the disconnection and the multi‐hop features of MANETs. The communication cost between nodes is reduced and the efficiency of the query is improved greatly.
Details
Keywords
Usha Manasi Mohapatra, Babita Majhi and Alok Kumar Jagadev
The purpose of this paper is to propose distributed learning-based three different metaheuristic algorithms for the identification of nonlinear systems. The proposed algorithms…
Abstract
Purpose
The purpose of this paper is to propose distributed learning-based three different metaheuristic algorithms for the identification of nonlinear systems. The proposed algorithms are experimented in this study to address problems for which input data are available at different geographic locations. In addition, the models are tested for nonlinear systems with different noise conditions. In a nutshell, the suggested model aims to handle voluminous data with low communication overhead compared to traditional centralized processing methodologies.
Design/methodology/approach
Population-based evolutionary algorithms such as genetic algorithm (GA), particle swarm optimization (PSO) and cat swarm optimization (CSO) are implemented in a distributed form to address the system identification problem having distributed input data. Out of different distributed approaches mentioned in the literature, the study has considered incremental and diffusion strategies.
Findings
Performances of the proposed distributed learning-based algorithms are compared for different noise conditions. The experimental results indicate that CSO performs better compared to GA and PSO at all noise strengths with respect to accuracy and error convergence rate, but incremental CSO is slightly superior to diffusion CSO.
Originality/value
This paper employs evolutionary algorithms using distributed learning strategies and applies these algorithms for the identification of unknown systems. Very few existing studies have been reported in which these distributed learning strategies are experimented for the parameter estimation task.
Details
Keywords
This review reports on the current state and the potential of tools and systems designed to aid online searching, referred to here as online searching aids. Intermediary…
Abstract
This review reports on the current state and the potential of tools and systems designed to aid online searching, referred to here as online searching aids. Intermediary mechanisms are examined in terms of the two stage model, i.e. end‐user, intermediary, ‘raw database’, and different forms of user — system interaction are discussed. The evolution of the terminology of online searching aids is presented with special emphasis on the expert/non‐expert division. Terms defined include gateways, front‐end systems, intermediary systems and post‐processing. The alternative configurations that such systems can have and the approaches to the design of the user interface are discussed. The review then analyses the functions of online searching aids, i.e. logon procedures, access to hosts, help features, search formulation, query reformulation, database selection, uploading, downloading and post‐processing. Costs are then briefly examined. The review concludes by looking at future trends following recent developments in computer science and elsewhere. Distributed expert based information systems (debis), the standard generalised mark‐up language (SGML), the client‐server model, object‐orientation and parallel processing are expected to influence, if they have not done so already, the design and implementation of future online searching aids.
Gerti Kappel and Stefan Vieweg
Changes in market and production profiles require a more flexibleconcept in manufacturing. Computer integrated manufacturing (CIM)describes an integrative concept for joining…
Abstract
Changes in market and production profiles require a more flexible concept in manufacturing. Computer integrated manufacturing (CIM) describes an integrative concept for joining business and manufacturing islands. In this context, database technology is the key technology for implementing the CIM philosophy. However, CIM applications are more complex and thus more demanding than traditional database applications such as business and administrative applications. Systematically analyses the database requirements for CIM applications including business and manufacturing tasks. Special emphasis is given on integration requirements due to the distributed, partly isolated nature of CIM applications developed over the years. An illustrative sampling of current efforts in the database community to meet the challenge of non‐standard applications such as CIM is presented.
Details
Keywords
Michael J. Frasciello and John Richardson
Library consortia require automation systems that adequately address the following questions: Can the system support centralized and decentralized server configurations? Does the…
Abstract
Library consortia require automation systems that adequately address the following questions: Can the system support centralized and decentralized server configurations? Does the software’s architecture accommodate changing requirements? Does the system provide seamless behavior? Contends that the evolution of distributed enterprise computing technology has brought the library automation industry to a new realization that automation systems engineered with an n‐tiered client/server architecture will best meet the needs of library consortia. Standards‐based distributed processing is the key to the n‐tier client/server paradigm. While some technologies (i.e. UNIX) provide for a single standard on which to define distributed processing, only Microsoft’s Windows NT supports multiple standards. From Microsoft’s perspective, the Windows NT operating system is the middle tier of the n‐tier client/server environment. To truly exploit the middle tier, an application must utilize Microsoft Transaction Server (MTS). Native Windows NT automation systems utilizing MTS are best positioned for the future because MTS assumes an n‐tier architecture with the middle tier (or tiers) deployed on Windows NT Server. “Native” NT applications are built in and for Microsoft Windows NT. Library consortia considering a native Windows NT automation system should evaluate the system’s distributed processing capabilities to determine its applicability to their needs. Library consortia can test a vendor’s claim to scalable distributed processing by asking three questions: Is the software dependent on the type of data being used? Does the software support logical and physical separation (distribution)? Does the software require a systems‐shut down to perform database or application updates?
Details
Keywords
The computer systems developed during the 1960s and 1970s made very little impact on management decision. Management Information System design was constrained by three factors …
Abstract
The computer systems developed during the 1960s and 1970s made very little impact on management decision. Management Information System design was constrained by three factors — the technology was large‐scale and inevitably centralised and controlled by data processing staff; the systems were designed by specialist staff who rarely understood the business requirements; and managers themselves had little knowledge or “hands‐on” experience of computers. In the 1980s a greater awareness of the need for planning and better use of personnel information, coupled with the development of distributed processing systems, has presented personnel management with opportunities to use computing technology as a means of increasing the professionalism of practising personnel managers. Effective use will only occur if the implementation of technology is matched by appraisal of skills and organisation within personnel departments. Staff will need a minimum level of computing expertise and some managers will need skills in modelling, particularly financial modelling. The relationship between personnel and data processing needs careful redefining to build a link between the two and data processing staff need to design and communicate an end‐user strategy.
Details
Keywords
Big data has posed problems for businesses, the Information Technology (IT) sector and the science community. The problems posed by big data can be effectively addressed using…
Abstract
Purpose
Big data has posed problems for businesses, the Information Technology (IT) sector and the science community. The problems posed by big data can be effectively addressed using cloud computing and associated distributed computing technology. Cloud computing and big data are two significant past-year problems that allow high-efficiency and competitive computing tools to be delivered as IT services. The paper aims to examine the role of the cloud as a tool for managing big data in various aspects to help businesses.
Design/methodology/approach
This paper delivers solutions in the cloud for storing, compressing, analyzing and processing big data. Hence, articles were divided into four categories: articles on big data storage, articles on big data processing, articles on analyzing and finally, articles on data compression in cloud computing. This article is based on a systematic literature review. Also, it is based on a review of 19 published papers on big data.
Findings
From the results, it can be inferred that cloud computing technology has features that can be useful for big data management. Challenging issues are raised in each section. For example, in storing big data, privacy and security issues are challenging.
Research limitations/implications
There were limitations to this systematic review. The first limitation is that only English articles were reviewed. Also, articles that matched the keywords were used. Finally, in this review, authoritative articles were reviewed, and slides and tutorials were avoided.
Practical implications
The research presents new insight into the business value of cloud computing in interfirm collaborations.
Originality/value
Previous research has often examined other aspects of big data in the cloud. This article takes a new approach to the subject. It allows big data researchers to comprehend the various aspects of big data management in the cloud. In addition, setting an agenda for future research saves time and effort for readers searching for topics within big data.
Details
Keywords
Bao-Rong Chang, Hsiu-Fen Tsai, Yun-Che Tsai, Chin-Fu Kuo and Chi-Chung Chen
The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big…
Abstract
Purpose
The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big data environment.
Design/methodology/approach
First, the integration of Apache Hive, Cloudera Impala and BDAS Shark make the platform support SQL-like query. Next, users can access a single interface and select the best performance of big data warehouse platform automatically by the proposed optimizer. Finally, the distributed memory storage system Memcached incorporated into the distributed file system, Apache HDFS, is employed for fast caching query results. Therefore, if users query the same SQL command, the same result responds rapidly from the cache system instead of suffering the repeated searches in a big data warehouse and taking a longer time to retrieve.
Findings
As a result the proposed approach significantly improves the overall performance and dramatically reduces the search time as querying a database, especially applying for the high-repeatable SQL commands under multi-user mode.
Research limitations/implications
Currently, Shark’s latest stable version 0.9.1 does not support the latest versions of Spark and Hive. In addition, this series of software only supports Oracle JDK7. Using Oracle JDK8 or Open JDK will cause serious errors, and some software will be unable to run.
Practical implications
The problem with this system is that some blocks are missing when too many blocks are stored in one result (about 100,000 records). Another problem is that the sequential writing into In-memory cache wastes time.
Originality/value
When the remaining memory capacity is 2 GB or less on each server, Impala and Shark will have a lot of page swapping, causing extremely low performance. When the data scale is larger, it may cause the JVM I/O exception and make the program crash. However, when the remaining memory capacity is sufficient, Shark is faster than Hive and Impala. Impala’s consumption of memory resources is between those of Shark and Hive. This amount of remaining memory is sufficient for Impala’s maximum performance. In this study, each server allocates 20 GB of memory for cluster computing and sets the amount of remaining memory as Level 1: 3 percent (0.6 GB), Level 2: 15 percent (3 GB) and Level 3: 75 percent (15 GB) as the critical points. The program automatically selects Hive when memory is less than 15 percent, Impala at 15 to 75 percent and Shark at more than 75 percent.
Details