Search results
1 – 10 of 994Wei Lu, Andrew MacFarlane and Fabio Venuti
Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing…
Abstract
Purpose
Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing. Clear use cases for structured search in XML have been established. However, most of the research in the area is either based on relational database systems or specialized semi‐structured data management systems. This paper aims to propose a method for XML indexing based on the information retrieval (IR) system Okapi.
Design/methodology/approach
First, the paper reviews the structure of inverted files and gives an overview of the issues of why this indexing mechanism cannot properly support XML retrieval, using the underlying data structures of Okapi as an example. Then the paper explores a revised method implemented on Okapi using path indexing structures. The paper evaluates these index structures through the metrics of indexing run time, path search run time and space costs using the INEX and Reuters RVC1 collections.
Findings
Initial results on the INEX collections show that there is a substantial overhead in space costs for the method, but this increase does not affect run time adversely. Indexing results on differing sized Reuters RVC1 sub‐collections show that the increase in space costs with increasing the size of a collection is significant, but in terms of run time the increase is linear. Path search results show sub‐millisecond run times, demonstrating minimal overhead for XML search.
Practical implications
Overall, the results show the method implemented to support XML search in a traditional IR system such as Okapi is viable.
Originality/value
The paper provides useful information on a method for XML indexing based on the IR system Okapi.
Details
Keywords
Atsushi Keyaki, Jun Miyazaki, Kenji Hatano, Goshiro Yamamoto, Takafumi Taketomi and Hirokazu Kato
The purpose of this paper is to propose methods for fast incremental indexing with effective and efficient query processing in XML element retrieval. The effectiveness of a search…
Abstract
Purpose
The purpose of this paper is to propose methods for fast incremental indexing with effective and efficient query processing in XML element retrieval. The effectiveness of a search system becomes lower if document updates are not handled when these occur frequently on the Web. The search accuracy is also reduced if drastic changes in document statistics are not managed. However, existing studies of XML element retrieval do not consider document updates, although these studies have attained both effectiveness and efficiency in query processing. Thus, the authors add a function for handling document updates to the existing techniques for XML element retrieval.
Design/methodology/approach
Though it will be important to enable fast updates of indices, preliminary experiments have shown that a simple incremental update approach has two problems: some kinds of statistics are inaccurate, and it takes a long time to update indices. Therefore, two methods are proposed: one to approximate term weights accurately with a small number of documents, even for dynamically changing statistics; and the other to eliminate unnecessary update targets.
Findings
Experimental results show that this proposed system can update indices up to 32 per cent faster than the simple incremental updates while the search accuracy improved by 4 per cent compared with the simple approach. The proposed methods can also be fast and accurate in query processing, even if document statistics change drastically.
Originality/value
The paper shows that there could be a more practical XML element search engine, which can access the latest XML documents accurately and efficiently.
Details
Keywords
Hooman Homayounfar and Fangju Wang
XML is becoming one of the most important structures for data exchange on the web. Despite having many advantages, XML structure imposes several major obstacles to large document…
Abstract
XML is becoming one of the most important structures for data exchange on the web. Despite having many advantages, XML structure imposes several major obstacles to large document processing. Inconsistency between the linear nature of the current algorithms (e.g. for caching and prefetch) used in operating systems and databases, and the non‐linear structure of XML data makes XML processing more costly. In addition to verbosity (e.g. tag redundancy), interpreting (i.e. parsing) depthfirst (DF) structure of XML documents is a significant overhead to processing applications (e.g. query engines). Recent research on XML query processing has learned that sibling clustering can improve performance significantly. However, the existing clustering methods are not able to avoid parsing overhead as they are limited by larger document sizes. In this research, We have developed a better data organization for native XML databases, named sibling‐first (SF) format that improves query performance significantly. SF uses an embedded index for fast accessing to child nodes. It also compresses documents by eliminating extra information from the original DF format. The converted SF documents can be processed for XPath query purposes without being parsed. We have implemented the SF storage in virtual memory as well as a format on disk. Experimental results with real data have showed that significantly higher performance can be achieved when XPath queries are conducted on very large SF documents.
Details
Keywords
Sayyed Mahdi Taheri, Nadjla Hariri and Sayyed Rahmatollah Fattahi
The aim of this research was to examine the use of the data island method for creating metadata records based on DCXML, MARCXML, and MODS with indexability and visibility of…
Abstract
Purpose
The aim of this research was to examine the use of the data island method for creating metadata records based on DCXML, MARCXML, and MODS with indexability and visibility of element tag names in web search engines.
Design/methodology/approach
A total of 600 metadata records were developed in two groups (300 HTML-based records in an experimental group with special structure embedded in the < pre> tag of HTML based on the data island method, and 300 XML-based records as the control group with the normal structure). These records were analyzed through an experimental approach. The records of these two groups were published on two independent websites, and were submitted to Google and Bing search engines.
Findings
Findings show that all the tag names of the metadata records created based on the data island method relating to the experimental group indexed by Google and Bing were visible in the search results. But the tag names in the control group's metadata records were not indexed by the search engines. Accordingly it is possible to index and retrieve the metadata records by their tag name in the search engines. But the records of the control group are accessible by the element values only. The research suggests some patterns to the metadata creators and the end users for better indexing and retrieval.
Originality/value
The research used the data island method for creating the metadata records, and deals with the indexability and visibility of the metadata element tag names for the first time.
Details
Keywords
The semantic and structural heterogeneity of large Extensible Markup Language (XML) digital libraries emphasizes the need of supporting approximate queries, i.e. queries where the…
Abstract
Purpose
The semantic and structural heterogeneity of large Extensible Markup Language (XML) digital libraries emphasizes the need of supporting approximate queries, i.e. queries where the matching conditions are relaxed so as to retrieve results that possibly partially satisfy the user's requests. The paper aims to propose a flexible query answering framework which efficiently supports complex approximate queries on XML data.
Design/methodology/approach
To reduce the number of relaxations applicable to a query, the paper relies on the specification of user preferences about the types of approximations allowed. A specifically devised index structure which efficiently supports both semantic and structural approximations, according to the specified user preferences, is proposed. Also, a ranking model to quantify approximations in the results is presented.
Findings
Personalized queries, on one hand, effectively narrow the space of query reformulations, on the other hand, enhance the user query capabilities with a great deal of flexibility and control over requests. As to the quality of results, the retrieval process considerably benefits because of the presence of user preferences in the queries. Experiments demonstrate the effectiveness and the efficiency of the proposal, as well as its scalability.
Research limitations/implications
Future developments concern the evaluation of the effectiveness of personalization on queries through additional examinations of the effects of the variability of parameters expressing user preferences.
Originality/value
The paper is intended for the research community and proposes a novel query model which incorporates user preferences about query relaxations on large heterogeneous XML data collections.
Details
Keywords
Abubakar Roko, Shyamala Doraisamy, Azrul Hazri Jantan and Azreen Azman
The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine…
Abstract
Purpose
The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine while retaining the simple keyword search query interface. A more effective way for searching XML database is to use structured queries. However, using query languages to express queries prove to be difficult for most users since this requires learning a query language and knowledge of the underlying data schema. On the other hand, the success of Web search engines has made many users to be familiar with keyword search and, therefore, they prefer to use a keyword search query interface to search XML data.
Design/methodology/approach
Existing query structuring approaches require users to provide structural hints in their input keyword queries even though their interface is keyword base. Other problems with existing systems include their inability to put keyword query ambiguities into consideration during query structuring and how to select the best generated structure query that best represents a given keyword query. To address these problems, this study allows users to submit a schema independent keyword query, use named entity recognition (NER) to categorize query keywords to resolve query ambiguities and compute semantic information for a node from its data content. Algorithms were proposed that find user search intentions and convert the intentions into a set of ranked structured queries.
Findings
Experiments with Sigmod and IMDB datasets were conducted to evaluate the effectiveness of the method. The experimental result shows that the XKQSS is about 20 per cent more effective than XReal in terms of return nodes identification, a state-of-art systems for XML retrieval.
Originality/value
Existing systems do not take keyword query ambiguities into account. XKSS consists of two guidelines based on NER that help to resolve these ambiguities before converting the submitted query. It also include a ranking function computes a score for each generated query by using both semantic information and data statistic, as opposed to data statistic only approach used by the existing approaches.
Details
Keywords
Branko Milosavljević, Danijela Boberić and Dušan Surla
The aim of the research is modeling and implementing a software component for the retrieval of bibliographic records using the Apache Lucene retrieval engine.
Abstract
Purpose
The aim of the research is modeling and implementing a software component for the retrieval of bibliographic records using the Apache Lucene retrieval engine.
Design/methodology/approach
Object‐oriented methodology is used for modeling and implementation of the bibliographic record retrieval engine. Modeling is carried out in the CASE tool that supports the unified modeling language (UML 2.0), while the implementation is using the Java programming language and open source components.
Findings
The result is a software component for the retrieval of bibliographic records that are independent of the bibliographic format used in cataloging. It features great flexibility in terms of configuring search types without the need to change the software implementation.
Research limitations/implications
One of the constraints of this system relates to the problem of searching linking entry fields. UNIMARC format defines fields used to link the item being cataloged to another bibliographic item, so those fields may contain other fields, which can be termed secondary fields. In this proposed solution, secondary fields are treated as all other fields and there is no information whether the search term belongs to the secondary or a regular field.
Practical implications
The proposed solution is integrated into library information system BISIS, version 4. This version of the BISIS system is in use at university, public and special libraries. By introducing this version, system performance as well as flexibility of the indexing process are improved and at the same time librarians are able to perform sophisticated and effective retrieval of bibliographic records.
Originality/value
The contribution of this work is in the design of a customizable record retrieval component. It is configured by means of an XML document for specifying mapping rules between subfields of the bibliographic record format and search types. By using XML it is possible to add new mapping rules without additional programming. In addition, great attention has been paid to the indexing of subfields that contain punctuation marks having special semantic meanings for librarians and the transliteration between Cyrillic and Latin scripts. Also, originality of this work lies in using the Apache Lucene search engine, which facilitates building highly flexible and efficient retrieval systems.
Details
Keywords
A. Hossein Farajpahlou and Faeze Tabatabai
The aim of this paper is to examine the indexing quality and ranking of XML content objects containing Dublin Core and MARC 21 metadata elements in dynamic online information…
Abstract
Purpose
The aim of this paper is to examine the indexing quality and ranking of XML content objects containing Dublin Core and MARC 21 metadata elements in dynamic online information environments by general search engines such as Google and Yahoo!
Design/methodology/approach
In total, 100 XML content objects were divided into two groups: those with DCXML elements and those with MARCXML elements. Both groups were published on the web site www.marcdcmi.ir in late July 2009 and were online until June 2010. The web site was introduced to Google and Yahoo! search engines. The indexing quality of metadata elements embedded in the content objects in a dynamic online information environment and their indexing and ranking capabilities were compared and examined.
Findings
Google search engine was able to retrieve fully all the content objects through their Dublin Core and MARC 21 metadata elements; Yahoo! search engine, however, did not respond at all. Results of the study showed that all Dublin Core and MARC 21 metadata elements were indexed by Google search engine. No difference was observed between indexing quality and ranking of DCXML metadata elements with that of MARCXML. The results of the study revealed that neither the XML‐based Dublin Core Metadata Initiative nor MARC 21 demonstrate any preference regarding access in dynamic online information environments through Google search engine.
Practical implications
The findings can provide useful information for search engine designers.
Originality/value
The present study was conducted for the first time in dynamic environments using XML‐based metadata elements. It can provide grounds for further studies of this kind.
Details
Keywords
Rebeca Schroeder, Denio Duarte and Ronaldo dos Santos Mello
Designing efficient XML schemas is essential for XML applications which manage semi‐structured data. On generating XML schemas, there are two opposite goals: to avoid redundancy…
Abstract
Purpose
Designing efficient XML schemas is essential for XML applications which manage semi‐structured data. On generating XML schemas, there are two opposite goals: to avoid redundancy and to provide connected structures in order to achieve good performance on queries. In general, highly connected XML structures allow data redundancy, and redundancy‐free schemas generate disconnected XML structures. The purpose of this paper is to describe and evaluate by experiments an approach which balances such trade‐off through a workload analysis. Additionally, it aims to identify the most accessed data based on the workload and suggest indexes to improve access performance.
Design/methodology/approach
The paper applies and evaluates a workload‐aware methodology to provide indexing and highly connected structures for data which are intensively accessed through paths traversed by the workload.
Findings
The paper presents benchmarking results on a set of design approaches for XML schemas and demonstrates that the XML schemas generated by the approach provide high query performance and low cost of data redundancy on balancing the trade‐off on XML schema design.
Research limitations/implications
Although an XML benchmark is applied in these experiments, further experiments are expected in a real‐world application.
Practical implications
The approach proposed may be applied in a real‐world process for designing new XML databases as well as in reverse engineering process to improve XML schemas from legacy databases.
Originality/value
Unlike related work, the reported approach integrates the two opposite goal in the XML schema design, and generates suitable schemas according to a workload. An experimental evaluation shows that the proposed methodology is promising.
Details
Keywords
The interrelation between Web publishing and information retrieval technologies is explored. The different elements of the Web have implications for indexing and searching Web…
Abstract
The interrelation between Web publishing and information retrieval technologies is explored. The different elements of the Web have implications for indexing and searching Web pages. There are two main platforms used for searching the Web – directories and search engines – which later became combined to create one‐stop search sites, resulting in the Web business model known as portals. Portalisation gave rise to a second‐generation of firms delivering innovative search technology. Various new approaches to Web indexing and information retrieval are listed. PC‐based search tools incorporate intelligent agents to allow greater manipulation of search strategies and results. Current trends are discussed, in particular the rise of XML, and their implications for the future. It is concluded that the Web is emerging from a nascent stage and is evolving into a more complex, diverse and structured environment.
Details