Search results
1 – 10 of over 13000This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an…
Abstract
This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an electronic document database that broadens the user's capability to retrieve relevant information more accurately, without going through costly processes to get paper documents into electronic text. The principles of document image processing systems, as well as the problems and shortcomings of most of today's document image processing systems, will be discussed. Then concept retrieval as the latest development in text retrieval will be discussed, with specific reference to the ability of the TOPIC intelligent text retrieval system to allow users to build up a knowledge base of search objects or concepts that can be used at any point in time by all users for the system. This paper will further specifically look at the automatic processing of paper documents by converting the scanned document image pages through to electronic text. The use of optical character recognition technology, the indexing and loading of the documents in a text database, the automatic linking of the documents to the related document images and the retrieval technology available in TOPIC, specifically the TYPO operator that was developed to handle so‐called dirty data such as the common misspellings, character transpositions and ‘dirty’ text received as output from the OCR process, will be discussed. A possible solution to load paper documents quickly and cost‐effectively into an electronic document database will be discussed and demonstrated in detail. The advantages and disadvantages of this approach will be discussed with specific reference to an electronic news clipping service application.
Suliman Al‐Hawamdeh, Rachel de Vere, Geoff Smith and Peter Willett
Full‐text documents are usually searched by means of a Boolean retrieval algorithm that requires the user to specify the logical relationships between the terms of a query. In…
Abstract
Full‐text documents are usually searched by means of a Boolean retrieval algorithm that requires the user to specify the logical relationships between the terms of a query. In this paper, we summarise the results to date of a continuing programme of research at the University of Sheffield to investigate the use of nearest‐neighbour retrieval algorithms for full‐text searching. Given a natural‐language query statement, our methods result in a ranking of the paragraphs comprising a full‐text document in order of decreasing similarity with the query, where the similarity for each paragraph is determined by the number of keyword stems that it has in common with the query. A full‐text document test collection has been created to allow systematic tests of retrieval effectiveness to be carried out. Experiments with this collection demonstrate that nearest‐neighbour searching provides a means for paragraph‐based access to full‐text documents that is of comparable effectiveness to both Boolean and hypertext searching and that index term weighting schemes which have been developed for the searching of bibliographical databases can also be used to improve the effectiveness of retrieval from full‐text databases. A current project is investigating the extent to which a paragraph‐based full‐text retrieval system can be used to augment the explication facilities of an expert system on welding.
The objective of the paper is to amalgamate theories of text retrieval from various research traditions into a cognitive theory for information retrieval interaction. Set in a…
Abstract
The objective of the paper is to amalgamate theories of text retrieval from various research traditions into a cognitive theory for information retrieval interaction. Set in a cognitive framework, the paper outlines the concept of polyrepresentation applied to both the user's cognitive space and the information space of IR systems. The concept seeks to represent the current user's information need, problem state, and domain work task or interest in a structure of causality. Further, it implies that we should apply different methods of representation and a variety of IR techniques of different cognitive and functional origin simultaneously to each semantic full‐text entity in the information space. The cognitive differences imply that by applying cognitive overlaps of information objects, originating from different interpretations of such objects through time and by type, the degree of uncertainty inherent in IR is decreased. Polyrepresentation and the use of cognitive overlaps are associated with, but not identical to, data fusion in IR. By explicitly incorporating all the cognitive structures participating in the interactive communication processes during IR, the cognitive theory provides a comprehensive view of these processes. It encompasses the ad hoc theories of text retrieval and IR techniques hitherto developed in mainstream retrieval research. It has elements in common with van Rijsbergen and Lalmas' logical uncertainty theory and may be regarded as compatible with that conception of IR. Epistemologically speaking, the theory views IR interaction as processes of cognition, potentially occurring in all the information processing components of IR, that may be applied, in particular, to the user in a situational context. The theory draws upon basic empirical results from information seeking investigations in the operational online environment, and from mainstream IR research on partial matching techniques and relevance feedback. By viewing users, source systems, intermediary mechanisms and information in a global context, the cognitive perspective attempts a comprehensive understanding of essential IR phenomena and concepts, such as the nature of information needs, cognitive inconsistency and retrieval overlaps, logical uncertainty, the concept of ‘document’, relevance measures and experimental settings. An inescapable consequence of this approach is to rely more on sociological and psychological investigative methods when evaluating systems and to view relevance in IR as situational, relative, partial, differentiated and non‐linear. The lack of consistency among authors, indexers, evaluators or users is of an identical cognitive nature. It is unavoidable, and indeed favourable to IR. In particular, for full‐text retrieval, alternative semantic entities, including Salton et al.'s ‘passage retrieval’, are proposed to replace the traditional document record as the basic retrieval entity. These empirically observed phenomena of inconsistency and of semantic entities and values associated with data interpretation support strongly a cognitive approach to IR and the logical use of polyrepresentation, cognitive overlaps, and both data fusion and data diffusion.
C.R. Watters, M.A. Shepherd, E.W. Grundke and P. Bodorik
Although the Boolean combination of keywords and/or subject codes is the predominant access method for the retrieval of passages from full‐text databases, menu access is an…
Abstract
Although the Boolean combination of keywords and/or subject codes is the predominant access method for the retrieval of passages from full‐text databases, menu access is an attractive alternative. The selection of an access method and the ensuing satisfaction with the results is based on the type of query and on the experience and knowledge of the user. This paper describes a prototype system which has integrated Boolean, menu, and direct access methods for the retrieval of passages from full‐text databases. The integration is based on the hierarchical structure inherent in such databases as legal statutes and regulations and engineering standards. The user may switch freely among access methods in order to develop the most appropriate search strategy. The retrieved passages are presented to the user within the context of the hierarchical structure.
A. Macfarlane, S.E. Robertson and J.A. Mccann
The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for text…
Abstract
The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for text retrieval. We analyse parallel IR systems using a classification defined by Rasmussen and describe some parallel IR systems. We give a description of the retrieval models used in parallel information processing. We describe areas of research which we believe are needed.
Details
Keywords
Ankie Visschedijk and Forbes Gibb
This article reviews some of the more unconventional text retrieval systems, emphasising those which have been commercialised. These sophisticated systems improve on conventional…
Abstract
This article reviews some of the more unconventional text retrieval systems, emphasising those which have been commercialised. These sophisticated systems improve on conventional retrieval by using either innovative software or hardware to increase retrieval speed or functionality, precision or recall. The software systems reviewed are: AIDA, CLARIT, Metamorph, SIMPR, STATUS/IQ, TCS, TINA and TOPIC. The hardware systems reviewed are: CAFS‐ISP, the Connection Machine, GESCAN,HSTS,MPP, TEXTRACT, TRW‐FDF and URSA.
Zahra Alvandi Poor, Mahdieh Mirzabeigi and Majid Nabavi
The purpose of this study aims to identify the impact of verbal-visual cognitive styles on the level of satisfaction and behavior in the textual and content search of Google…
Abstract
Purpose
The purpose of this study aims to identify the impact of verbal-visual cognitive styles on the level of satisfaction and behavior in the textual and content search of Google Images.
Design/methodology/approach
“Riding” cognitive style test and satisfaction questionnaire were used as data collection tools. Also, to collect data related to the image search behavior, the subjects’ transaction files were recorded using Camtasia software and then the files observed and reviewed. The research sample was 90 postgraduate students of Shiraz University.
Findings
The results showed that cognitive styles in interaction with the text-based and content-based search system of “Google Images” affected user’s satisfaction. Text-based image retrieval, in which vocabulary-based information needs were expressed, was more compatible with the verbal cognitive style and resulted in greater satisfaction. In contrast, in content-based image retrieval, where it was possible to express information needs in the form of images, users were more satisfied with the visual cognitive style. Verbal users performed more positively in text-based search and visual users in content-based search.
Originality/value
Considering the research gap, which has identified the performance of visual text-based and content-based systems in terms of satisfaction and cognitive style search behavior, the present study could be considered a small effort to promote science.
Details
Keywords
Yanti Idaya Aspura M.K. and Shahrul Azman Mohd Noah
The purpose of this study is to reduce the semantic distance by proposing a model for integrating indexes of textual and visual features via a multi-modality ontology and the use…
Abstract
Purpose
The purpose of this study is to reduce the semantic distance by proposing a model for integrating indexes of textual and visual features via a multi-modality ontology and the use of DBpedia to improve the comprehensiveness of the ontology to enhance semantic retrieval.
Design/methodology/approach
A multi-modality ontology-based approach was developed to integrate high-level concepts and low-level features, as well as integrate the ontology base with DBpedia to enrich the knowledge resource. A complete ontology model was also developed to represent the domain of sport news, with image caption keywords and image features. Precision and recall were used as metrics to evaluate the effectiveness of the multi-modality approach, and the outputs were compared with those obtained using a single-modality approach (i.e. textual ontology and visual ontology).
Findings
The results based on ten queries show a superior performance of the multi-modality ontology-based IMR system integrated with DBpedia in retrieving correct images in accordance with user queries. The system achieved 100 per cent precision for six of the queries and greater than 80 per cent precision for the other four queries. The text-based system only achieved 100 per cent precision for one query; all other queries yielded precision rates less than 0.500.
Research limitations/implications
This study only focused on BBC Sport News collection in the year 2009.
Practical implications
The paper includes implications for the development of ontology-based retrieval on image collection.
Originality value
This study demonstrates the strength of using a multi-modality ontology integrated with DBpedia for image retrieval to overcome the deficiencies of text-based and ontology-based systems. The result validates semantic text-based with multi-modality ontology and DBpedia as a useful model to reduce the semantic distance.
Details
Keywords
Present and possible future developments in the techniques of document management are reviewed, the major ones being text retrieval and scanning and OCR. Acquisition, indexing and…
Abstract
Present and possible future developments in the techniques of document management are reviewed, the major ones being text retrieval and scanning and OCR. Acquisition, indexing and thesauri, publishing and dissemination and the document management industry are also addressed. The emerging standards are reviewed and the impact of the Internet is analysed.
Details
Keywords
BRIAN VICKERY and ALINA VICKERY
There is a huge amount of information and data stored in publicly available online databases that consist of large text files accessed by Boolean search techniques. It is widely…
Abstract
There is a huge amount of information and data stored in publicly available online databases that consist of large text files accessed by Boolean search techniques. It is widely held that less use is made of these databases than could or should be the case, and that one reason for this is that potential users find it difficult to identify which databases to search, to use the various command languages of the hosts and to construct the Boolean search statements required. This reasoning has stimulated a considerable amount of exploration and development work on the construction of search interfaces, to aid the inexperienced user to gain effective access to these databases. The aim of our paper is to review aspects of the design of such interfaces: to indicate the requirements that must be met if maximum aid is to be offered to the inexperienced searcher; to spell out the knowledge that must be incorporated in an interface if such aid is to be given; to describe some of the solutions that have been implemented in experimental and operational interfaces; and to discuss some of the problems encountered. The paper closes with an extensive bibliography of references relevant to online search aids, going well beyond the items explicitly mentioned in the text. An index to software appears after the bibliography at the end of the paper.