Search results

1 – 10 of 994
Article
Publication date: 18 September 2009

Wei Lu, Andrew MacFarlane and Fabio Venuti

Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing

Abstract

Purpose

Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing. Clear use cases for structured search in XML have been established. However, most of the research in the area is either based on relational database systems or specialized semi‐structured data management systems. This paper aims to propose a method for XML indexing based on the information retrieval (IR) system Okapi.

Design/methodology/approach

First, the paper reviews the structure of inverted files and gives an overview of the issues of why this indexing mechanism cannot properly support XML retrieval, using the underlying data structures of Okapi as an example. Then the paper explores a revised method implemented on Okapi using path indexing structures. The paper evaluates these index structures through the metrics of indexing run time, path search run time and space costs using the INEX and Reuters RVC1 collections.

Findings

Initial results on the INEX collections show that there is a substantial overhead in space costs for the method, but this increase does not affect run time adversely. Indexing results on differing sized Reuters RVC1 sub‐collections show that the increase in space costs with increasing the size of a collection is significant, but in terms of run time the increase is linear. Path search results show sub‐millisecond run times, demonstrating minimal overhead for XML search.

Practical implications

Overall, the results show the method implemented to support XML search in a traditional IR system such as Okapi is viable.

Originality/value

The paper provides useful information on a method for XML indexing based on the IR system Okapi.

Details

Aslib Proceedings, vol. 61 no. 5
Type: Research Article
ISSN: 0001-253X

Keywords

Article
Publication date: 14 June 2013

Atsushi Keyaki, Jun Miyazaki, Kenji Hatano, Goshiro Yamamoto, Takafumi Taketomi and Hirokazu Kato

The purpose of this paper is to propose methods for fast incremental indexing with effective and efficient query processing in XML element retrieval. The effectiveness of a search…

Abstract

Purpose

The purpose of this paper is to propose methods for fast incremental indexing with effective and efficient query processing in XML element retrieval. The effectiveness of a search system becomes lower if document updates are not handled when these occur frequently on the Web. The search accuracy is also reduced if drastic changes in document statistics are not managed. However, existing studies of XML element retrieval do not consider document updates, although these studies have attained both effectiveness and efficiency in query processing. Thus, the authors add a function for handling document updates to the existing techniques for XML element retrieval.

Design/methodology/approach

Though it will be important to enable fast updates of indices, preliminary experiments have shown that a simple incremental update approach has two problems: some kinds of statistics are inaccurate, and it takes a long time to update indices. Therefore, two methods are proposed: one to approximate term weights accurately with a small number of documents, even for dynamically changing statistics; and the other to eliminate unnecessary update targets.

Findings

Experimental results show that this proposed system can update indices up to 32 per cent faster than the simple incremental updates while the search accuracy improved by 4 per cent compared with the simple approach. The proposed methods can also be fast and accurate in query processing, even if document statistics change drastically.

Originality/value

The paper shows that there could be a more practical XML element search engine, which can access the latest XML documents accurately and efficiently.

Details

International Journal of Web Information Systems, vol. 9 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 31 December 2006

Hooman Homayounfar and Fangju Wang

XML is becoming one of the most important structures for data exchange on the web. Despite having many advantages, XML structure imposes several major obstacles to large document…

Abstract

XML is becoming one of the most important structures for data exchange on the web. Despite having many advantages, XML structure imposes several major obstacles to large document processing. Inconsistency between the linear nature of the current algorithms (e.g. for caching and prefetch) used in operating systems and databases, and the non‐linear structure of XML data makes XML processing more costly. In addition to verbosity (e.g. tag redundancy), interpreting (i.e. parsing) depthfirst (DF) structure of XML documents is a significant overhead to processing applications (e.g. query engines). Recent research on XML query processing has learned that sibling clustering can improve performance significantly. However, the existing clustering methods are not able to avoid parsing overhead as they are limited by larger document sizes. In this research, We have developed a better data organization for native XML databases, named sibling‐first (SF) format that improves query performance significantly. SF uses an embedded index for fast accessing to child nodes. It also compresses documents by eliminating extra information from the original DF format. The converted SF documents can be processed for XPath query purposes without being parsed. We have implemented the SF storage in virtual memory as well as a format on disk. Experimental results with real data have showed that significantly higher performance can be achieved when XPath queries are conducted on very large SF documents.

Details

International Journal of Web Information Systems, vol. 2 no. 3/4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 11 March 2014

Sayyed Mahdi Taheri, Nadjla Hariri and Sayyed Rahmatollah Fattahi

The aim of this research was to examine the use of the data island method for creating metadata records based on DCXML, MARCXML, and MODS with indexability and visibility of

Abstract

Purpose

The aim of this research was to examine the use of the data island method for creating metadata records based on DCXML, MARCXML, and MODS with indexability and visibility of element tag names in web search engines.

Design/methodology/approach

A total of 600 metadata records were developed in two groups (300 HTML-based records in an experimental group with special structure embedded in the < pre> tag of HTML based on the data island method, and 300 XML-based records as the control group with the normal structure). These records were analyzed through an experimental approach. The records of these two groups were published on two independent websites, and were submitted to Google and Bing search engines.

Findings

Findings show that all the tag names of the metadata records created based on the data island method relating to the experimental group indexed by Google and Bing were visible in the search results. But the tag names in the control group's metadata records were not indexed by the search engines. Accordingly it is possible to index and retrieve the metadata records by their tag name in the search engines. But the records of the control group are accessible by the element values only. The research suggests some patterns to the metadata creators and the end users for better indexing and retrieval.

Originality/value

The research used the data island method for creating the metadata records, and deals with the indexability and visibility of the metadata element tag names for the first time.

Details

Library Hi Tech, vol. 32 no. 1
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 29 August 2008

Wilma Penzo

The semantic and structural heterogeneity of large Extensible Markup Language (XML) digital libraries emphasizes the need of supporting approximate queries, i.e. queries where the…

Abstract

Purpose

The semantic and structural heterogeneity of large Extensible Markup Language (XML) digital libraries emphasizes the need of supporting approximate queries, i.e. queries where the matching conditions are relaxed so as to retrieve results that possibly partially satisfy the user's requests. The paper aims to propose a flexible query answering framework which efficiently supports complex approximate queries on XML data.

Design/methodology/approach

To reduce the number of relaxations applicable to a query, the paper relies on the specification of user preferences about the types of approximations allowed. A specifically devised index structure which efficiently supports both semantic and structural approximations, according to the specified user preferences, is proposed. Also, a ranking model to quantify approximations in the results is presented.

Findings

Personalized queries, on one hand, effectively narrow the space of query reformulations, on the other hand, enhance the user query capabilities with a great deal of flexibility and control over requests. As to the quality of results, the retrieval process considerably benefits because of the presence of user preferences in the queries. Experiments demonstrate the effectiveness and the efficiency of the proposal, as well as its scalability.

Research limitations/implications

Future developments concern the evaluation of the effectiveness of personalization on queries through additional examinations of the effects of the variability of parameters expressing user preferences.

Originality/value

The paper is intended for the research community and proposes a novel query model which incorporates user preferences about query relaxations on large heterogeneous XML data collections.

Details

International Journal of Web Information Systems, vol. 4 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 20 April 2015

Abubakar Roko, Shyamala Doraisamy, Azrul Hazri Jantan and Azreen Azman

The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine…

Abstract

Purpose

The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine while retaining the simple keyword search query interface. A more effective way for searching XML database is to use structured queries. However, using query languages to express queries prove to be difficult for most users since this requires learning a query language and knowledge of the underlying data schema. On the other hand, the success of Web search engines has made many users to be familiar with keyword search and, therefore, they prefer to use a keyword search query interface to search XML data.

Design/methodology/approach

Existing query structuring approaches require users to provide structural hints in their input keyword queries even though their interface is keyword base. Other problems with existing systems include their inability to put keyword query ambiguities into consideration during query structuring and how to select the best generated structure query that best represents a given keyword query. To address these problems, this study allows users to submit a schema independent keyword query, use named entity recognition (NER) to categorize query keywords to resolve query ambiguities and compute semantic information for a node from its data content. Algorithms were proposed that find user search intentions and convert the intentions into a set of ranked structured queries.

Findings

Experiments with Sigmod and IMDB datasets were conducted to evaluate the effectiveness of the method. The experimental result shows that the XKQSS is about 20 per cent more effective than XReal in terms of return nodes identification, a state-of-art systems for XML retrieval.

Originality/value

Existing systems do not take keyword query ambiguities into account. XKSS consists of two guidelines based on NER that help to resolve these ambiguities before converting the submitted query. It also include a ranking function computes a score for each generated query by using both semantic information and data statistic, as opposed to data statistic only approach used by the existing approaches.

Details

International Journal of Web Information Systems, vol. 11 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 10 August 2010

Branko Milosavljević, Danijela Boberić and Dušan Surla

The aim of the research is modeling and implementing a software component for the retrieval of bibliographic records using the Apache Lucene retrieval engine.

1152

Abstract

Purpose

The aim of the research is modeling and implementing a software component for the retrieval of bibliographic records using the Apache Lucene retrieval engine.

Design/methodology/approach

Object‐oriented methodology is used for modeling and implementation of the bibliographic record retrieval engine. Modeling is carried out in the CASE tool that supports the unified modeling language (UML 2.0), while the implementation is using the Java programming language and open source components.

Findings

The result is a software component for the retrieval of bibliographic records that are independent of the bibliographic format used in cataloging. It features great flexibility in terms of configuring search types without the need to change the software implementation.

Research limitations/implications

One of the constraints of this system relates to the problem of searching linking entry fields. UNIMARC format defines fields used to link the item being cataloged to another bibliographic item, so those fields may contain other fields, which can be termed secondary fields. In this proposed solution, secondary fields are treated as all other fields and there is no information whether the search term belongs to the secondary or a regular field.

Practical implications

The proposed solution is integrated into library information system BISIS, version 4. This version of the BISIS system is in use at university, public and special libraries. By introducing this version, system performance as well as flexibility of the indexing process are improved and at the same time librarians are able to perform sophisticated and effective retrieval of bibliographic records.

Originality/value

The contribution of this work is in the design of a customizable record retrieval component. It is configured by means of an XML document for specifying mapping rules between subfields of the bibliographic record format and search types. By using XML it is possible to add new mapping rules without additional programming. In addition, great attention has been paid to the indexing of subfields that contain punctuation marks having special semantic meanings for librarians and the transliteration between Cyrillic and Latin scripts. Also, originality of this work lies in using the Apache Lucene search engine, which facilitates building highly flexible and efficient retrieval systems.

Details

The Electronic Library, vol. 28 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 22 November 2011

A. Hossein Farajpahlou and Faeze Tabatabai

The aim of this paper is to examine the indexing quality and ranking of XML content objects containing Dublin Core and MARC 21 metadata elements in dynamic online information…

1826

Abstract

Purpose

The aim of this paper is to examine the indexing quality and ranking of XML content objects containing Dublin Core and MARC 21 metadata elements in dynamic online information environments by general search engines such as Google and Yahoo!

Design/methodology/approach

In total, 100 XML content objects were divided into two groups: those with DCXML elements and those with MARCXML elements. Both groups were published on the web site www.marcdcmi.ir in late July 2009 and were online until June 2010. The web site was introduced to Google and Yahoo! search engines. The indexing quality of metadata elements embedded in the content objects in a dynamic online information environment and their indexing and ranking capabilities were compared and examined.

Findings

Google search engine was able to retrieve fully all the content objects through their Dublin Core and MARC 21 metadata elements; Yahoo! search engine, however, did not respond at all. Results of the study showed that all Dublin Core and MARC 21 metadata elements were indexed by Google search engine. No difference was observed between indexing quality and ranking of DCXML metadata elements with that of MARCXML. The results of the study revealed that neither the XML‐based Dublin Core Metadata Initiative nor MARC 21 demonstrate any preference regarding access in dynamic online information environments through Google search engine.

Practical implications

The findings can provide useful information for search engine designers.

Originality/value

The present study was conducted for the first time in dynamic environments using XML‐based metadata elements. It can provide grounds for further studies of this kind.

Details

Aslib Proceedings, vol. 63 no. 6
Type: Research Article
ISSN: 0001-253X

Keywords

Article
Publication date: 16 November 2012

Rebeca Schroeder, Denio Duarte and Ronaldo dos Santos Mello

Designing efficient XML schemas is essential for XML applications which manage semi‐structured data. On generating XML schemas, there are two opposite goals: to avoid redundancy…

Abstract

Purpose

Designing efficient XML schemas is essential for XML applications which manage semi‐structured data. On generating XML schemas, there are two opposite goals: to avoid redundancy and to provide connected structures in order to achieve good performance on queries. In general, highly connected XML structures allow data redundancy, and redundancy‐free schemas generate disconnected XML structures. The purpose of this paper is to describe and evaluate by experiments an approach which balances such trade‐off through a workload analysis. Additionally, it aims to identify the most accessed data based on the workload and suggest indexes to improve access performance.

Design/methodology/approach

The paper applies and evaluates a workload‐aware methodology to provide indexing and highly connected structures for data which are intensively accessed through paths traversed by the workload.

Findings

The paper presents benchmarking results on a set of design approaches for XML schemas and demonstrates that the XML schemas generated by the approach provide high query performance and low cost of data redundancy on balancing the trade‐off on XML schema design.

Research limitations/implications

Although an XML benchmark is applied in these experiments, further experiments are expected in a real‐world application.

Practical implications

The approach proposed may be applied in a real‐world process for designing new XML databases as well as in reverse engineering process to improve XML schemas from legacy databases.

Originality/value

Unlike related work, the reported approach integrates the two opposite goal in the XML schema design, and generates suitable schemas according to a workload. An experimental evaluation shows that the proposed methodology is promising.

Article
Publication date: 1 April 2000

David Green

The interrelation between Web publishing and information retrieval technologies is explored. The different elements of the Web have implications for indexing and searching Web…

2648

Abstract

The interrelation between Web publishing and information retrieval technologies is explored. The different elements of the Web have implications for indexing and searching Web pages. There are two main platforms used for searching the Web – directories and search engines – which later became combined to create one‐stop search sites, resulting in the Web business model known as portals. Portalisation gave rise to a second‐generation of firms delivering innovative search technology. Various new approaches to Web indexing and information retrieval are listed. PC‐based search tools incorporate intelligent agents to allow greater manipulation of search strategies and results. Current trends are discussed, in particular the rise of XML, and their implications for the future. It is concluded that the Web is emerging from a nascent stage and is evolving into a more complex, diverse and structured environment.

Details

Online Information Review, vol. 24 no. 2
Type: Research Article
ISSN: 1468-4527

Keywords

1 – 10 of 994