Search results

1 – 10 of over 4000
Article
Publication date: 22 June 2010

Imam Machdi, Toshiyuki Amagasa and Hiroyuki Kitagawa

The purpose of this paper is to propose general parallelism techniques for holistic twig join algorithms to process queries against Extensible Markup Language (XML) databases on a…

Abstract

Purpose

The purpose of this paper is to propose general parallelism techniques for holistic twig join algorithms to process queries against Extensible Markup Language (XML) databases on a multi‐core system.

Design/methodology/approach

The parallelism techniques comprised data and task parallelism. As for data parallelism, the paper adopted the stream‐based partitioning for XML to partition XML data as the basis of parallelism on multiple CPU cores. The XML data partitioning was performed in two levels. The first level was to create buckets for creating data independence and balancing loads among CPU cores; each bucket was assigned onto a CPU core. Within each bucket, the second level of XML data partitioning was performed to create finer partitions for providing finer parallelism. Each CPU core performed the holistic twig join algorithm on each finer partition of its own in parallel with other CPU cores. In task parallelism, the holistic twig join algorithm was decomposed into two main tasks, which were pipelined to create parallelism. The first task adopted the data parallelism technique and their outputs were transferred to the second task periodically. Since data transfers incurred overheads, the size of each data transfer needed to be estimated cautiously for achieving optimal performance.

Findings

The data and task parallelism techniques contribute to good performance especially for queries having complex structures and/or higher values of query selectivity. The performance of data parallelism can be further improved by task parallelism. Significant performance improvement is attained by queries having higher selectivity because more outputs computed by the second task is performed in parallel with the first task.

Research limitations/implications

The proposed parallelism techniques primarily deals with executing a single long‐running query for intra‐query parallelism, partitioning XML data on‐the‐fly, and allocating partitions on CPU cores statically. During the parallel execution, presumably there are no such dynamic XML data updates.

Practical implications

The effectiveness of the proposed parallel holistic twig joins relies fundamentally on some system parameter values that can be obtained from a benchmark of the system platform.

Originality/value

The paper proposes novel techniques to increase parallelism by combining techniques of data and task parallelism for achieving high performance. To the best of the author's knowledge, this is the first paper of parallelizing the holistic twig join algorithms on a multi‐core system.

Details

International Journal of Web Information Systems, vol. 6 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 28 September 2007

Alasdair J.G. Gray, Werner Nutt and M. Howard Williams

Distributed data streams are an important topic of current research. In such a setting, data values will be missed, e.g. due to network errors. This paper aims to allow this…

Abstract

Purpose

Distributed data streams are an important topic of current research. In such a setting, data values will be missed, e.g. due to network errors. This paper aims to allow this incompleteness to be detected and overcome with either the user not being affected or the effects of the incompleteness being reported to the user.

Design/methodology/approach

A model for representing the incomplete information has been developed that captures the information that is known about the missing data. Techniques for query answering involving certain and possible answer sets have been extended so that queries over incomplete data stream histories can be answered.

Findings

It is possible to detect when a distributed data stream is missing one or more values. When such data values are missing there will be some information that is known about the data and this is stored in an appropriate format. Even when the available data are incomplete, it is possible in some circumstances to answer a query completely. When this is not possible, additional meta‐data can be returned to inform the user of the effects of the incompleteness.

Research limitations/implications

The techniques and models proposed in this paper have only been partially implemented.

Practical implications

The proposed system is general and can be applied wherever there is a need to query the history of distributed data streams. The work in this paper enables the system to answer queries when there are missing values in the data.

Originality/value

This paper presents a general model of how to detect, represent, and answer historical queries over incomplete distributed data streams.

Details

International Journal of Web Information Systems, vol. 3 no. 1/2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 19 June 2009

Imam Machdi, Toshiyuki Amagasa and Hiroyuki Kitagawa

The purpose of this paper is to propose Extensible Markup Language (XML) data partitioning schemes that can cope with static and dynamic allocation for parallel holistic twig…

Abstract

Purpose

The purpose of this paper is to propose Extensible Markup Language (XML) data partitioning schemes that can cope with static and dynamic allocation for parallel holistic twig joins: grid metadata model for XML (GMX) and streams‐based partitioning method for XML (SPX).

Design/methodology/approach

GMX exploits the relationships between XML documents and query patterns to perform workload‐aware partitioning of XML data. Specifically, the paper constructs a two‐dimensional model with a document dimension and a query dimension in which each object in a dimension is composed from XML metadata related to the dimension. GMX provides a set of XML data partitioning methods that include document clustering, query clustering, document‐based refinement, query‐based refinement, and query‐path refinement, thereby enabling XML data partitioning based on the static information of XML metadata. In contrast, SPX explores the structural relationships of query elements and a range‐containment property of XML streams to generate partitions and allocate them to cluster nodes on‐the‐fly.

Findings

GMX provides several salient features: a set of partition granularities that balance workloads of query processing costs among cluster nodes statically; inter‐query parallelism as well as intra‐query parallelism at multiple extents; and better parallel query performance when all estimated queries are executed simultaneously to meet their probability of query occurrences in the system. SPX also offers the following features: minimal computation time to generate partitions; balancing skewed workloads dynamically on the system; producing higher intra‐query parallelism; and gaining better parallel query performance.

Research limitations/implications

The current status of the proposed XML data partitioning schemes does not take into account XML data updates, e.g. new XML documents and query pattern changes submitted by users on the system.

Practical implications

Note that effectiveness of the XML data partitioning schemes mainly relies on the accuracy of the cost model to estimate query processing costs. The cost model must be adjusted to reflect characteristics of a system platform used in the implementation.

Originality/value

This paper proposes novel schemes of conducting XML data partitioning to achieve both static and dynamic workload balance.

Details

International Journal of Web Information Systems, vol. 5 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 17 August 2015

Savong Bou, Toshiyuki Amagasa and Hiroyuki Kitagawa

In purpose of this paper is to propose a novel scheme to process XPath-based keyword search over Extensible Markup Language (XML) streams, where one can specify query keywords and…

114

Abstract

Purpose

In purpose of this paper is to propose a novel scheme to process XPath-based keyword search over Extensible Markup Language (XML) streams, where one can specify query keywords and XPath-based filtering conditions at the same time. Experimental results prove that our proposed scheme can efficiently and practically process XPath-based keyword search over XML streams.

Design/methodology/approach

To allow XPath-based keyword search over XML streams, it was attempted to integrate YFilter (Diao et al., 2003) with CKStream (Hummel et al., 2011). More precisely, the nondeterministic finite automation (NFA) of YFilter is extended so that keyword matching at text nodes is supported. Next, the stack data structure is modified by integrating set of NFA states in YFilter with bitmaps generated from set of keyword queries in CKStream.

Findings

Extensive experiments were conducted using both synthetic and real data set to show the effectiveness of the proposed method. The experimental results showed that the accuracy of the proposed method was better than the baseline method (CKStream), while it consumed less memory. Moreover, the proposed scheme showed good scalability with respect to the number of queries.

Originality/value

Due to the rapid diffusion of XML streams, the demand for querying such information is also growing. In such a situation, the ability to query by combining XPath and keyword search is important, because it is easy to use, but powerful means to query XML streams. However, none of existing works has addressed this issue. This work is to cope with this problem by combining an existing XPath-based YFilter and a keyword-search-based CKStream for XML streams to enable XPath-based keyword search.

Details

International Journal of Web Information Systems, vol. 11 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 1 September 2005

Yuval Elovici, Chanan Glezer and Bracha Shapira

To propose a model of a privacy‐enhanced catalogue search system (PECSS) in an attempt to address privacy threats to consumers, who search for products and services on the world…

2403

Abstract

Purpose

To propose a model of a privacy‐enhanced catalogue search system (PECSS) in an attempt to address privacy threats to consumers, who search for products and services on the world wide web.

Design/methodology/approach

The model extends an agent‐based architecture for electronic catalogue mediation by supplementing it with a privacy enhancement mechanism. This mechanism introduces fake queries into the original stream of user queries, in an attempt to reduce the similarity between the actual interests of users (“internal user profile”) and the interests as observed by potential eavesdroppers on the web (“external user profile”). A prototype was constructed to demonstrate the feasibility and effectiveness of the model.

Findings

The evaluation of the model indicates that, by generating five fake queries per each original user query, the user's profile is hidden most effectively from any potential eavesdropper. Future research is needed to identify the optimal glossary of fake queries for various clients. The model also should be tested against various attacks perpetrated against the mixed stream of original and fake queries (i.e. statistical clustering).

Research limitations/implications

The model's feasibility was evaluated through a prototype. It was not empirically tested against various statistical methods used by intruders to reveal the original queries.

Practical implications

A useful architecture for electronic commerce providers, internet service providers (ISP) and individual clients who are concerned with their privacy and wish to minimize their dependencies on third‐party security providers.

Originality/value

The contribution of the PECSS model stems from the fact that, as the internet gradually transforms into a non‐free service, anonymous browsing cannot be employed any more to protect consumers' privacy, and therefore other approaches should be explored. Moreover, unlike other approaches, our model does not rely on the honesty of any third mediators and proxies that are also exposed to the interests of the client. In addition, the proposed model is scalable as it is installed on the user's computer.

Details

Internet Research, vol. 15 no. 4
Type: Research Article
ISSN: 1066-2243

Keywords

Open Access
Article
Publication date: 14 August 2017

Xiu Susie Fang, Quan Z. Sheng, Xianzhi Wang, Anne H.H. Ngu and Yihong Zhang

This paper aims to propose a system for generating actionable knowledge from Big Data and use this system to construct a comprehensive knowledge base (KB), called GrandBase.

2049

Abstract

Purpose

This paper aims to propose a system for generating actionable knowledge from Big Data and use this system to construct a comprehensive knowledge base (KB), called GrandBase.

Design/methodology/approach

In particular, this study extracts new predicates from four types of data sources, namely, Web texts, Document Object Model (DOM) trees, existing KBs and query stream to augment the ontology of the existing KB (i.e. Freebase). In addition, a graph-based approach to conduct better truth discovery for multi-valued predicates is also proposed.

Findings

Empirical studies demonstrate the effectiveness of the approaches presented in this study and the potential of GrandBase. The future research directions regarding GrandBase construction and extension has also been discussed.

Originality/value

To revolutionize our modern society by using the wisdom of Big Data, considerable KBs have been constructed to feed the massive knowledge-driven applications with Resource Description Framework triples. The important challenges for KB construction include extracting information from large-scale, possibly conflicting and different-structured data sources (i.e. the knowledge extraction problem) and reconciling the conflicts that reside in the sources (i.e. the truth discovery problem). Tremendous research efforts have been contributed on both problems. However, the existing KBs are far from being comprehensive and accurate: first, existing knowledge extraction systems retrieve data from limited types of Web sources; second, existing truth discovery approaches commonly assume each predicate has only one true value. In this paper, the focus is on the problem of generating actionable knowledge from Big Data. A system is proposed, which consists of two phases, namely, knowledge extraction and truth discovery, to construct a broader KB, called GrandBase.

Details

PSU Research Review, vol. 1 no. 2
Type: Research Article
ISSN: 2399-1747

Keywords

Article
Publication date: 15 June 2012

Hooran MahmoudiNasab and Sherif Sakr

The purpose of this paper is to present a two‐phase approach for designing an efficient tailored but flexible storage solution for resource description framework (RDF) data based…

Abstract

Purpose

The purpose of this paper is to present a two‐phase approach for designing an efficient tailored but flexible storage solution for resource description framework (RDF) data based on its query workload characteristics.

Design/methodology/approach

The approach consists of two phases. The vertical partitioning phase which aims of reducing the number of join operations in the query evaluation process, while the adjustment phase aims to maintain the efficiency of the performance of the query processing by adapting the underlying schema to cope with the dynamic nature of the query workloads.

Findings

The authors perform comprehensive experiments on two real‐world RDF datasets to demonstrate that the approach is superior to the state‐of‐the‐art techniques in this domain.

Originality/value

The main motivation behind the authors' approach is that several benchmarking studies have recently shown that each RDF dataset requires a tailored table schema in order to achieve efficient performance during query processing. None of the previous approaches have considered this limitation.

Details

International Journal of Web Information Systems, vol. 8 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 29 November 2011

Na Dai and Brian D. Davison

This work aims to investigate the sensitivity of ranking performance with respect to the topic distribution of queries selected for ranking evaluation.

Abstract

Purpose

This work aims to investigate the sensitivity of ranking performance with respect to the topic distribution of queries selected for ranking evaluation.

Design/methodology/approach

The authors reweight queries used in two TREC tasks to make them match three real background topic distributions, and show that the performance rankings of retrieval systems are quite different.

Findings

It is found that search engines tend to perform similarly on queries about the same topic; and search engine performance is sensitive to the topic distribution of queries used in evaluation.

Originality/value

Using experiments with multiple real‐world query logs, the paper demonstrates weaknesses in the current evaluation model of retrieval systems.

Article
Publication date: 21 June 2023

Parvin Reisinezhad and Mostafa Fakhrahmad

Questionnaire studies of knowledge, attitude and practice (KAP) are effective research in the field of health, which have many shortcomings. The purpose of this research is to…

Abstract

Purpose

Questionnaire studies of knowledge, attitude and practice (KAP) are effective research in the field of health, which have many shortcomings. The purpose of this research is to propose an automatic questionnaire-free method based on deep learning techniques to address the shortcomings of common methods. Next, the aim of this research is to use the proposed method with public comments on Twitter to get the gaps in KAP of people regarding COVID-19.

Design/methodology/approach

In this paper, two models are proposed to achieve the mentioned purposes, the first one for attitude and the other for people’s knowledge and practice. First, the authors collect some tweets from Twitter and label them. After that, the authors preprocess the collected textual data. Then, the text representation vector for each tweet is extracted using BERT-BiGRU or XLNet-GRU. Finally, for the knowledge and practice problem, a multi-label classifier with 16 classes representing health guidelines is proposed. Also, for the attitude problem, a multi-class classifier with three classes (positive, negative and neutral) is proposed.

Findings

Labeling quality has a direct relationship with the performance of the final model, the authors calculated the inter-rater reliability using the Krippendorf alpha coefficient, which shows the reliability of the assessment in both problems. In the problem of knowledge and practice, 87% and in the problem of people’s attitude, 95% agreement was reached. The high agreement obtained indicates the reliability of the dataset and warrants the assessment. The proposed models in both problems were evaluated with some metrics, which shows that both proposed models perform better than the common methods. Our analyses for KAP are more efficient than questionnaire methods. Our method has solved many shortcomings of questionnaires, the most important of which is increasing the speed of evaluation, increasing the studied population and receiving reliable opinions to get accurate results.

Research limitations/implications

Our research is based on social network datasets. This data cannot provide the possibility to discover the public information of users definitively. Addressing this limitation can have a lot of complexity and little certainty, so in this research, the authors presented our final analysis independent of the public information of users.

Practical implications

Combining recurrent neural networks with methods based on the attention mechanism improves the performance of the model and solves the need for large training data. Also, using these methods is effective in the process of improving the implementation of KAP research and eliminating its shortcomings. These results can be used in other text processing tasks and cause their improvement. The results of the analysis on the attitude, practice and knowledge of people regarding the health guidelines lead to the effective planning and implementation of health decisions and interventions and required training by health institutions. The results of this research show the effective relationship between attitude, practice and knowledge. People are better at following health guidelines than being aware of COVID-19. Despite many tensions during the epidemic, most people still discuss the issue with a positive attitude.

Originality/value

To the best of our knowledge, so far, no text processing-based method has been proposed to perform KAP research. Also, our method benefits from the most valuable data of today’s era (i.e. social networks), which is the expression of people’s experiences, facts and free opinions. Therefore, our final analysis provides more realistic results.

Details

Kybernetes, vol. 52 no. 7
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 9 March 2020

Bharat Arun Tidke, Rupa Mehta, Dipti Rana, Divyani Mittal and Pooja Suthar

In online social network analysis, the problem of identification and ranking of influential nodes based on their prominence has attracted immense attention from researchers and…

Abstract

Purpose

In online social network analysis, the problem of identification and ranking of influential nodes based on their prominence has attracted immense attention from researchers and practitioners. Identification and ranking of influential nodes is a challenging problem using Twitter, as data contains heterogeneous features such as tweets, likes, mentions and retweets. The purpose of this paper is to perform correlation between various features, evaluation metrics, approaches and results to validate selection of features as well as results. In addition, the paper uses well-known techniques to find topical authority and sentiments of influential nodes that help smart city governance and to make importance decisions while understanding the various perceptions of relevant influential nodes.

Design/methodology/approach

The tweets fetched using Twitter API are stored in Neo4j to generate graph-based relationships between various features of Twitter data such as followers, mentions and retweets. In this paper, consensus approach based on Twitter data using heterogeneous features has been proposed based on various features such as like, mentions and retweets to generate individual list of top-k influential nodes based on each features.

Findings

The heterogeneous features are meant for integrating to accomplish identification and ranking tasks with low computational complexity, i.e. O(n), which is suitable for large-scale online social network with better accuracy than baselines.

Originality/value

Identified influential nodes can act as source in making public decisions and their opinion give insights to urban governance bodies such as municipal corporation as well as similar organization responsible for smart urban governance and smart city development.

Details

Kybernetes, vol. 50 no. 2
Type: Research Article
ISSN: 0368-492X

Keywords

1 – 10 of over 4000