Search results
1 – 10 of over 10000Suliman Al‐Hawamdeh, Rachel de Vere, Geoff Smith and Peter Willett
Full‐text documents are usually searched by means of a Boolean retrieval algorithm that requires the user to specify the logical relationships between the terms of a query. In…
Abstract
Full‐text documents are usually searched by means of a Boolean retrieval algorithm that requires the user to specify the logical relationships between the terms of a query. In this paper, we summarise the results to date of a continuing programme of research at the University of Sheffield to investigate the use of nearest‐neighbour retrieval algorithms for full‐text searching. Given a natural‐language query statement, our methods result in a ranking of the paragraphs comprising a full‐text document in order of decreasing similarity with the query, where the similarity for each paragraph is determined by the number of keyword stems that it has in common with the query. A full‐text document test collection has been created to allow systematic tests of retrieval effectiveness to be carried out. Experiments with this collection demonstrate that nearest‐neighbour searching provides a means for paragraph‐based access to full‐text documents that is of comparable effectiveness to both Boolean and hypertext searching and that index term weighting schemes which have been developed for the searching of bibliographical databases can also be used to improve the effectiveness of retrieval from full‐text databases. A current project is investigating the extent to which a paragraph‐based full‐text retrieval system can be used to augment the explication facilities of an expert system on welding.
Complete texts of many journals are now available for online searching. Most of these full text databases have been made available on the same or similar search systems that…
Abstract
Complete texts of many journals are now available for online searching. Most of these full text databases have been made available on the same or similar search systems that provide access to bibliographic information. The systems use inverted files that retain limited context information (e.g., paragraphs and location of words within paragraphs). The retrieval techniques used are simply those that were developed earlier for bibliographic databases. Retrieval relies on Boolean logic, word stem searching with truncation, and word proximity specification. Minor adjustments have been made for the display of full text databases, allowing words resulting in retrieval to be displayed in context; but changes have not been made in retrieval techniques. This is due to the reliance on search systems that provide access to many types of databases, all of which are by‐products of improved techniques for creating printed publications.
This article describes the fastest growing category of machine‐readable data‐bases — full‐text databases. A selection of articles from the literature on full‐text databases was…
Abstract
This article describes the fastest growing category of machine‐readable data‐bases — full‐text databases. A selection of articles from the literature on full‐text databases was explored and this provides a basis for the information presented here on search strategy, performance measurement, and benefits and limitations of full‐text databases. Various use studies and uses of full‐text databases have also been listed.
The objective of the paper is to amalgamate theories of text retrieval from various research traditions into a cognitive theory for information retrieval interaction. Set in a…
Abstract
The objective of the paper is to amalgamate theories of text retrieval from various research traditions into a cognitive theory for information retrieval interaction. Set in a cognitive framework, the paper outlines the concept of polyrepresentation applied to both the user's cognitive space and the information space of IR systems. The concept seeks to represent the current user's information need, problem state, and domain work task or interest in a structure of causality. Further, it implies that we should apply different methods of representation and a variety of IR techniques of different cognitive and functional origin simultaneously to each semantic full‐text entity in the information space. The cognitive differences imply that by applying cognitive overlaps of information objects, originating from different interpretations of such objects through time and by type, the degree of uncertainty inherent in IR is decreased. Polyrepresentation and the use of cognitive overlaps are associated with, but not identical to, data fusion in IR. By explicitly incorporating all the cognitive structures participating in the interactive communication processes during IR, the cognitive theory provides a comprehensive view of these processes. It encompasses the ad hoc theories of text retrieval and IR techniques hitherto developed in mainstream retrieval research. It has elements in common with van Rijsbergen and Lalmas' logical uncertainty theory and may be regarded as compatible with that conception of IR. Epistemologically speaking, the theory views IR interaction as processes of cognition, potentially occurring in all the information processing components of IR, that may be applied, in particular, to the user in a situational context. The theory draws upon basic empirical results from information seeking investigations in the operational online environment, and from mainstream IR research on partial matching techniques and relevance feedback. By viewing users, source systems, intermediary mechanisms and information in a global context, the cognitive perspective attempts a comprehensive understanding of essential IR phenomena and concepts, such as the nature of information needs, cognitive inconsistency and retrieval overlaps, logical uncertainty, the concept of ‘document’, relevance measures and experimental settings. An inescapable consequence of this approach is to rely more on sociological and psychological investigative methods when evaluating systems and to view relevance in IR as situational, relative, partial, differentiated and non‐linear. The lack of consistency among authors, indexers, evaluators or users is of an identical cognitive nature. It is unavoidable, and indeed favourable to IR. In particular, for full‐text retrieval, alternative semantic entities, including Salton et al.'s ‘passage retrieval’, are proposed to replace the traditional document record as the basic retrieval entity. These empirically observed phenomena of inconsistency and of semantic entities and values associated with data interpretation support strongly a cognitive approach to IR and the logical use of polyrepresentation, cognitive overlaps, and both data fusion and data diffusion.
C.R. Watters, M.A. Shepherd, E.W. Grundke and P. Bodorik
Although the Boolean combination of keywords and/or subject codes is the predominant access method for the retrieval of passages from full‐text databases, menu access is an…
Abstract
Although the Boolean combination of keywords and/or subject codes is the predominant access method for the retrieval of passages from full‐text databases, menu access is an attractive alternative. The selection of an access method and the ensuing satisfaction with the results is based on the type of query and on the experience and knowledge of the user. This paper describes a prototype system which has integrated Boolean, menu, and direct access methods for the retrieval of passages from full‐text databases. The integration is based on the hierarchical structure inherent in such databases as legal statutes and regulations and engineering standards. The user may switch freely among access methods in order to develop the most appropriate search strategy. The retrieved passages are presented to the user within the context of the hierarchical structure.
E.G. Sieverts, M. Hofstede and B. Oude Groeniger
In this article, the fourth in a series on microcomputer software for information storage and retrieval, test results of six indexing and full‐text retrieval programs are…
Abstract
In this article, the fourth in a series on microcomputer software for information storage and retrieval, test results of six indexing and full‐text retrieval programs are presented and various properties and qualities of these programs are discussed. The common feature of programs in these categories is that they are primarily meant to retrieve words (or combinations of them) in large text files. To do this they either simply index existing text files in one or more formats (indexing programs), or they store and index them in their own database format (full‐text retrieval programs). The programs reviewed in this issue are the indexing programs Ask‐It, Texplore and ZYindex and the full‐text retrieval programs KAware, TextMaster and WordCruncher. All programs run under MS‐DOS. In addition ZYindex has a Windows and a Unix version and TextMaster is also available for Unix. For each of the six programs almost 100 facts and test results are tabulated. The programs are also discussed individually.
W. Tauchert, J. Hospodarsky, J. Krause, C. Schneider and C. Womser‐Hacker
This paper reports the results of the information retrieval project PADOK‐II. This project, which began in November 1987, is being carried out by the Linguistic Information…
Abstract
This paper reports the results of the information retrieval project PADOK‐II. This project, which began in November 1987, is being carried out by the Linguistic Information Science Group of the University of Regensburg (LIR) in cooperation with the German Patent Office (GPO) and is sponsored by the German Ministry for Research and Technology. The long‐term aim is to integrate artificial intelligence into information retrieval research without neglecting traditional information retrieval methodology. In PADOK‐II an information retrieval system is considered which indexes documents rather shallowly using free‐text or morphological components. A large‐scale retrieval test has been carried out, based on the German Patent Information System. Answers have been obtained to some 400 queries made by 10 users in simulated real‐life situations. These results have been used to attempt to answer the question: ‘How do the linguistically‐based functions of an indexing system contribute to its performance?’ As a spinoff of this test, the influence of document size and structure was studied with a view to identifying the most reasonable basic content for a German Patent Information System.
Suliman Al‐Hawamdeh, Geoff Smith and Peter Willett
This paper considers the use of a hypertext system, GUIDE, for paragraph‐based searching in full‐text documents. Searching can be effected in GUIDE using both a conventional…
Abstract
This paper considers the use of a hypertext system, GUIDE, for paragraph‐based searching in full‐text documents. Searching can be effected in GUIDE using both a conventional, word‐based approach and using the inter‐textual linkage facilities. The effectiveness of these retrieval techniques are evaluated by means of searches of three full‐text documents for which relevance data are available. The results of the searches are compared with those obtained from use of a nearest neighbour retrieval system that has been developed for the ranking of paragraphs within full‐text documents. The comparison suggests that the linkage facilities in hypertext do not provide a very cost‐effective mechanism for paragraph‐based retrieval.
This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an…
Abstract
This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an electronic document database that broadens the user's capability to retrieve relevant information more accurately, without going through costly processes to get paper documents into electronic text. The principles of document image processing systems, as well as the problems and shortcomings of most of today's document image processing systems, will be discussed. Then concept retrieval as the latest development in text retrieval will be discussed, with specific reference to the ability of the TOPIC intelligent text retrieval system to allow users to build up a knowledge base of search objects or concepts that can be used at any point in time by all users for the system. This paper will further specifically look at the automatic processing of paper documents by converting the scanned document image pages through to electronic text. The use of optical character recognition technology, the indexing and loading of the documents in a text database, the automatic linking of the documents to the related document images and the retrieval technology available in TOPIC, specifically the TYPO operator that was developed to handle so‐called dirty data such as the common misspellings, character transpositions and ‘dirty’ text received as output from the OCR process, will be discussed. A possible solution to load paper documents quickly and cost‐effectively into an electronic document database will be discussed and demonstrated in detail. The advantages and disadvantages of this approach will be discussed with specific reference to an electronic news clipping service application.
Ankie Visschedijk and Forbes Gibb
This article reviews some of the more unconventional text retrieval systems, emphasising those which have been commercialised. These sophisticated systems improve on conventional…
Abstract
This article reviews some of the more unconventional text retrieval systems, emphasising those which have been commercialised. These sophisticated systems improve on conventional retrieval by using either innovative software or hardware to increase retrieval speed or functionality, precision or recall. The software systems reviewed are: AIDA, CLARIT, Metamorph, SIMPR, STATUS/IQ, TCS, TINA and TOPIC. The hardware systems reviewed are: CAFS‐ISP, the Connection Machine, GESCAN,HSTS,MPP, TEXTRACT, TRW‐FDF and URSA.