Search results
1 – 10 of 36Bao-Rong Chang, Hsiu-Fen Tsai, Yun-Che Tsai, Chin-Fu Kuo and Chi-Chung Chen
The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big…
Abstract
Purpose
The purpose of this paper is to integrate and optimize a multiple big data processing platform with the features of high performance, high availability and high scalability in big data environment.
Design/methodology/approach
First, the integration of Apache Hive, Cloudera Impala and BDAS Shark make the platform support SQL-like query. Next, users can access a single interface and select the best performance of big data warehouse platform automatically by the proposed optimizer. Finally, the distributed memory storage system Memcached incorporated into the distributed file system, Apache HDFS, is employed for fast caching query results. Therefore, if users query the same SQL command, the same result responds rapidly from the cache system instead of suffering the repeated searches in a big data warehouse and taking a longer time to retrieve.
Findings
As a result the proposed approach significantly improves the overall performance and dramatically reduces the search time as querying a database, especially applying for the high-repeatable SQL commands under multi-user mode.
Research limitations/implications
Currently, Shark’s latest stable version 0.9.1 does not support the latest versions of Spark and Hive. In addition, this series of software only supports Oracle JDK7. Using Oracle JDK8 or Open JDK will cause serious errors, and some software will be unable to run.
Practical implications
The problem with this system is that some blocks are missing when too many blocks are stored in one result (about 100,000 records). Another problem is that the sequential writing into In-memory cache wastes time.
Originality/value
When the remaining memory capacity is 2 GB or less on each server, Impala and Shark will have a lot of page swapping, causing extremely low performance. When the data scale is larger, it may cause the JVM I/O exception and make the program crash. However, when the remaining memory capacity is sufficient, Shark is faster than Hive and Impala. Impala’s consumption of memory resources is between those of Shark and Hive. This amount of remaining memory is sufficient for Impala’s maximum performance. In this study, each server allocates 20 GB of memory for cluster computing and sets the amount of remaining memory as Level 1: 3 percent (0.6 GB), Level 2: 15 percent (3 GB) and Level 3: 75 percent (15 GB) as the critical points. The program automatically selects Hive when memory is less than 15 percent, Impala at 15 to 75 percent and Shark at more than 75 percent.
Details
Keywords
Rabab Hayek, Guillaume Raschia, Patrick Valduriez and Noureddine Mouaddib
The goal of this paper is to contribute to the development of both data localization and description techniques in P2P systems.
Abstract
Purpose
The goal of this paper is to contribute to the development of both data localization and description techniques in P2P systems.
Design/methodology/approach
The approach consists of introducing a novel indexing technique that relies on linguistic data summarization into the context of P2P systems.
Findings
The cost model of the approach, as well as the simulation results have shown that the approach allows the efficient maintenance of data summaries, without incurring high traffic overhead. In addition, the cost of query routing is significantly reduced in the context of summaries.
Research limitations/implications
The paper has considered a summary service defined on the APPA's architecture. Future works have to study the extension of this work in order to be generally applicable to any P2P data management system.
Practical implications
This paper has mainly studied the quantitative gain that could be obtained in query processing from exploiting data summaries. Future works aim to implement this technique on real data (not synthetic) in order to study the qualitative gain that can be obtained from approximately answering a query.
Originality/value
The novelty of the approach shown in the paper relies on the double exploitation of the summaries in P2P systems: data summaries allow for a semantic‐based query routing, and also for an approximate query answering, using their intentional descriptions.
Details
Keywords
Kento Goto, Misato Kotani and Motomichi Toyama
Currently, the results of database acquisition are variously expressed, but it seems that users’ understanding degree will be improved by expressing some search results such as…
Abstract
Purpose
Currently, the results of database acquisition are variously expressed, but it seems that users’ understanding degree will be improved by expressing some search results such as images of products of shopping sites in three dimensions rather than two dimensions. Therefore, this paper aims to propose a system for automatically generating 3D virtual museum that arranges 3D objects with various layouts from the acquisition result of relation database by SuperSQL query.
Design/methodology/approach
The study extended the SuperSQL to generate 3D virtual reality museum using declarative queries on relational data stored in a database.
Findings
This system made it possible to generate various three-dimensional virtual spaces with different layouts through simple queries.
Originality/value
It can be said that this system is useful in that a complicated three-dimensional virtual space can be generated by describing a simple query and a different three-dimensional virtual space can be generated by slightly changing the query or database content. When creating a virtual museum, if there are too many exhibitions or when changing the layout, the burden on the user will be high. But in this system, it is possible to automatically generate various virtual museums easily and reduce the burden on users.
Details
Keywords
Aymen Gammoudi, Allel Hadjali and Boutheina Ben Yaghlane
Time modeling is a crucial feature in many application domains. However, temporal information often is not crisp, but is subjective and fuzzy. The purpose of this paper is to…
Abstract
Purpose
Time modeling is a crucial feature in many application domains. However, temporal information often is not crisp, but is subjective and fuzzy. The purpose of this paper is to address the issue related to the modeling and handling of imperfection inherent to both temporal relations and intervals.
Design/methodology/approach
On the one hand, fuzzy extensions of Allen temporal relations are investigated and, on the other hand, extended temporal relations to define the positions of two fuzzy time intervals are introduced. Then, a database system, called Fuzzy Temporal Information Management and Exploitation (Fuzz-TIME), is developed for the purpose of processing fuzzy temporal queries.
Findings
To evaluate the proposal, the authors have implemented a Fuzz-TIME system and created a fuzzy historical database for the querying purpose. Some demonstrative scenarios from history domain are proposed and discussed.
Research limitations/implications
The authors have conducted some experiments on archaeological data to show the effectiveness of the Fuzz-TIME system. However, thorough experiments on large-scale databases are highly desirable to show the behavior of the tool with respect to the performance and time execution criteria.
Practical implications
The tool developed (Fuzz-TIME) can have many practical applications where time information has to be dealt with. In particular, in several real-world applications like history, medicine, criminal and financial domains, where time is often perceived or expressed in an imprecise/fuzzy manner.
Social implications
The social implications of this work can be expected, more particularly, in two domains: in the museum to manage, exploit and analysis the piece of information related to archives and historic data; and in the hospitals/medical organizations to deal with time information inherent to data about patients and diseases.
Originality/value
This paper presents the design and characterization of a novel and intelligent database system to process and manage the imperfection inherent to both temporal relations and intervals.
Details
Keywords
Wei Xing, Marios D. Dikaiakos, Hua Yang, Angelos Sphyris and George Eftichidis
This paper aims to describe the main challenges of identifying and accessing useful information and knowledge about natural hazards and disasters results. The paper presents a…
Abstract
Purpose
This paper aims to describe the main challenges of identifying and accessing useful information and knowledge about natural hazards and disasters results. The paper presents a grid‐based digital library system designed to address the challenges.
Design/methodology/approach
The need to organize and publish metadata about European research results in the field of natural disasters has been met with the help of two innovative technologies: the Open Grid Service Architecture (OGSA) and the Resource Description Framework (RDF). OGSA provides a common platform for sharing distributed metadata securely. RDF facilitates the creation and exchange of metadata.
Findings
Using grid technology allows the RDF metadata of European research results in the field of natural disasters to be shared securely and effectively in a heterogeneous network environment.
Originality/value
A metadata approach is proposed for the extraction of the metadata, and their distribution to third parties in batch, and their sharing with other applications can be a quickly process. Furthermore, a method is set out to describe metadata in a common and open format, which can become a widely accepted standard; the existence of a common standard enables the metadata storage in different platforms while supporting the capability of distributed queries across different metadata databases, the integration of metadata extracted from different sources, etc. It can be used for the general‐purpose search engines.
Details
Keywords
Database management systems (DBMS) and information retrieval (IR) systems can both be used as online information systems but they differ in the type of data and the types of…
Abstract
Database management systems (DBMS) and information retrieval (IR) systems can both be used as online information systems but they differ in the type of data and the types of retrieval they provide for users. Many previous attempts have been made to couple DBMS and IR systems together, either by integrating the two into a unified framework, or by using a DBMS as an implementation tool for information retrieval functionality. This paper reports on some of these previous attempts and describes a system, retriev, which uses a DBMS to implement an IR system for teaching and research purposes. The implementation of retriev is described in detail and the effects that the current trends in database research will have on the relationship between DBMS and IR systems, are discussed.
Web resource usage statistics enable server owners to monitor how their users use their Web sites. However, such statistics are only compiled for individual servers. If resource…
Abstract
Web resource usage statistics enable server owners to monitor how their users use their Web sites. However, such statistics are only compiled for individual servers. If resource usage was monitored across the whole Web, the changing interests of society would be revealed, and deep insights made into the changing nature of the Web. However, capturing the information required for such a service, and providing acceptable system performance, presents significant challenges. As such, we have developed a model, called WebRUM, which offers a scalable system‐wide solution through the extension of a resource migration mechanism that we have previously designed. The paper describes the mechanism, and shows how it can be extended to monitor Web‐wide resource usage. The information stored by the model is defined, and the performance of a prototype mechanism is presented to demonstrate the effectiveness of the design.
Details
Keywords
Alejandro Vera-Baquero, Ricardo Colomo Palacios, Vladimir Stantchev and Owen Molloy
This paper aims to present a solution that enables organizations to monitor and analyse the performance of their business processes by means of Big Data technology. Business…
Abstract
Purpose
This paper aims to present a solution that enables organizations to monitor and analyse the performance of their business processes by means of Big Data technology. Business process improvement can drastically influence in the profit of corporations and helps them to remain viable. However, the use of traditional Business Intelligence systems is not sufficient to meet today ' s business needs. They normally are business domain-specific and have not been sufficiently process-aware to support the needs of process improvement-type activities, especially on large and complex supply chains, where it entails integrating, monitoring and analysing a vast amount of dispersed event logs, with no structure, and produced on a variety of heterogeneous environments. This paper tackles this variability by devising different Big-Data-based approaches that aim to gain visibility into process performance.
Design/methodology/approach
Authors present a cloud-based solution that leverages (BD) technology to provide essential insights into business process improvement. The proposed solution is aimed at measuring and improving overall business performance, especially in very large and complex cross-organisational business processes, where this type of visibility is hard to achieve across heterogeneous systems.
Findings
Three different (BD) approaches have been undertaken based on Hadoop and HBase. We introduced first, a map-reduce approach that it is suitable for batch processing and presents a very high scalability. Secondly, we have described an alternative solution by integrating the proposed system with Impala. This approach has significant improvements in respect with map reduce as it is focused on performing real-time queries over HBase. Finally, the use of secondary indexes has been also proposed with the aim of enabling immediate access to event instances for correlation in detriment of high duplication storage and synchronization issues. This approach has produced remarkable results in two real functional environments presented in the paper.
Originality/value
The value of the contribution relies on the comparison and integration of software packages towards an integrated solution that is aimed to be adopted by industry. Apart from that, in this paper, authors illustrate the deployment of the architecture in two different settings.
Details
Keywords
These projects aim to improve library services for users in the future by combining Link Open Data (LOD) technology with data visualization. It displays and analyses search…
Abstract
Purpose
These projects aim to improve library services for users in the future by combining Link Open Data (LOD) technology with data visualization. It displays and analyses search results in an intuitive manner. These services are enhanced by integrating various LOD technologies into the authority control system.
Design/methodology/approach
The technology known as LOD is used to access, recycle, share, exchange and disseminate information, among other things. The applicability of Linked Data technologies for the development of library information services is evaluated in this study.
Findings
Apache Hadoop is used for rapidly storing and processing massive Linked Data data sets. Apache Spark is a free and open-source data processing tool. Hive is a SQL-based data warehouse that enables data scientists to write, read and manage petabytes of data.
Originality/value
The distributed large data storage system Apache HBase does not use SQL. This study’s goal is to search the geographic, authority and bibliographic databases for relevant links found on various websites. When data items are linked together, all of the data bits are linked together as well. The study observed and evaluated the tools and processes and recorded each data item’s URL. As a result, data can be combined across silos, enhanced by third-party data sources and contextualized.
Details
Keywords
Terry D. May, Shaun H. Dunning, George A. Dowding and Jason O. Hallstrom
Wireless sensor networks (WSNs) will profoundly influence the ubiquitous computing landscape. Their utility derives not from the computational capabilities of any single sensor…
Abstract
Wireless sensor networks (WSNs) will profoundly influence the ubiquitous computing landscape. Their utility derives not from the computational capabilities of any single sensor node, but from the emergent capabilities of many communicating sensor nodes. Consequently, the details of communication within and across single hop neighborhoods is a fundamental component of most WSN applications. But these details are often complex, and popular embedded languages for WSNs provide only low‐level communication primitives. We propose that the absence of suitable communication abstractions contributes to the difficulty of developing large‐scale WSN applications. To address this issue, we present the design and implementation of a Remote Procedure Call (RPC) abstraction for nesC and TinyOS, the emerging standard for developing WSN applications. We present the key language extensions, operating system services, and automation tools that enable the proposed abstraction. We illustrate these contributions in the context of a representative case study, and analyze the overhead introduced when using our approach. We use these results to draw conclusions regarding the suitably of our work to resource‐constrained sensor nodes.
Details