This note was evoked by the reference by Karen Sparck Jones to a paper by Zunde and Slamecka which has recently been reprinted in Introduction to Information Science, edited by Saracevic. Zunde and Slamecka purport to show that, for optimum performance of IR systems, the frequency distribution of descriptor terms should conform with a geometric progression. This result is at variance with the widely accepted result derived from the Shannon model which shows that optimum performance of an IR system occurs when the descriptor terms are equi‐probable, i.e. when their frequency distribution is uniform. The uncertainty arising from these two different solutions to the same problem clearly led Karen Sparck Jones to have some reservations about the theoretical justification for her interesting idea of weighting search terms to give them, in effect, the equal weights that the usual Shannon result demands for optimum performance. But Sparck Jones need have no such reservations. The result obtained by Zunde and Slamecka, though plausible because it has some fortuitous semblance to the distributions of terms found in real systems, is in fact erroneous.
This article reviews the state of the art in automatic indexing, that is, automatic techniques for analysing and characterising documents, for manipulating their descriptions in searching, and for generating the index language used for these purposes. It concentrates on the literature from 1968 to 1973. Section I defines the topic and its context. Sections II and III consider work in syntax and semantics respectively in detail. Section IV comments on ‘indirect’ indexing. Section V briefly surveys operating mechanized systems. In Section VI major experiments in automatic indexing are reviewed, and Section VII attempts an overall conclusion on the current state of automatic indexing techniques.
To suggest that a theory of classification for information retrieval (IR), asked for by Spärck Jones in a 1970 paper, presupposes a full implementation of a pragmatic…
To suggest that a theory of classification for information retrieval (IR), asked for by Spärck Jones in a 1970 paper, presupposes a full implementation of a pragmatic understanding. Part of the Journal of Documentation celebration, “60 years of the best in information research”.
Literature‐based conceptual analysis, taking Spärck Jones as its starting‐point. Analysis involves distinctions between “positivism” and “pragmatism” and “classical” versus Kuhnian understandings of concepts.
Classification, both manual and automatic, for retrieval benefits from drawing upon a combination of qualitative and quantitative techniques, a consideration of theories of meaning, and the adding of top‐down approaches to IR in which divisions of labour, domains, traditions, genres, document architectures etc. are included as analytical elements and in which specific IR algorithms are based on the examination of specific literatures. Introduces an example illustrating the consequences of a full implementation of a pragmatist understanding when handling homonyms.
Outlines how to classify from a pragmatic‐philosophical point of view.
Provides, emphasizing a pragmatic understanding, insights of importance to classification for retrieval, both manual and automatic.
This article has been withdrawn as it was published elsewhere and accidentally duplicated. The original article can be seen here: 10.1108/eb026488. When citing the article, please cite: KAREN SPARCK JONES, (1970), “SOME THOUGHTS ON CLASSIFICATION FOR RETRIEVAL”, Journal of Documentation, Vol. 26 Iss: 2, pp. 89 - 101.
Previous experiments demonstrated the value of relevance weighting for search terms, but relied on substantial relevance information for the terms. The present experiments…
Previous experiments demonstrated the value of relevance weighting for search terms, but relied on substantial relevance information for the terms. The present experiments were designed to study the effects of weights based on very limited relevance information, for example supplied by one or two relevant documents. The tests simulated iterative searching, as in an on‐line system, and show that even very little relevance information can be of considerable value.
This short note seeks to respond to Hjørland and Pederson's paper “A substantive theory of classification for information retrieval” which starts from Spärck Jones's, “Some thoughts on classification for retrieval”, originally published in 1970.
The note comments on the context in which the 1970 paper was written, and on Hjørland and Pedersen's views, emphasising the need for well‐grounded classification theory and application.
The note maintains that text‐based, a posteriori, classification, as increasingly found in applications, is likely to be more useful, in general, than a priori classification.
The note elaborates on points made in a well‐received earlier paper.
It would be very helpful in retrieval experiments if good retrieval performance for a test collection was known, so that performance for particular devices could be fully evaluated. This paper presents one performance yardstick, based on optimally weighted request terms, and illustrates its application to different test collections.
This paper reports experiments with a term weighting model incorporating relevance information in which it is assumed that index terms are distributed dependently…
This paper reports experiments with a term weighting model incorporating relevance information in which it is assumed that index terms are distributed dependently. Initially this model was tested with complete relevance information against a similar model which assumes index terms are distributed independently. The experiments demonstrated conclusively that index terms are not independent for a number of diverse document collections. It was concluded that the use of relevance information together with dependence information could potentially improve retrieval effectiveness. As a result of further experiments the initial strict dependence model was modified and in particular a new relevance‐based term weight was developed. This modified dependence model was then used as the basis for relevance feedback, i.e. with partial relevance information only, and significant increases in retrieval effectiveness were achieved. The evaluation method used in the feedback experiments emphasized the effect of the feedback on documents which the potential user would not previously have seen. Finally the incorporation of relevance feedback in an operational system is considered and in particular it is argued that if high recall searches are required, relevance feedback based on the modified dependence model may be superior to the widely used Boolean search.