Search results

1 – 10 of 161
Article
Publication date: 6 March 2009

Kimmo Kettunen

The purpose of this article is to discuss advantages and disadvantages of various means to manage morphological variation of keywords in monolingual information retrieval.

Abstract

Purpose

The purpose of this article is to discuss advantages and disadvantages of various means to manage morphological variation of keywords in monolingual information retrieval.

Design/methodology/approach

The authors present a compilation of query results from 11 mostly European languages and a new general classification of the language dependent techniques for management of morphological variation. Variants of the different techniques are compared in some detail in terms of retrieval effectiveness and other criteria. The paper consists mainly of an overview of different management methods for keyword variation in information retrieval. Typical IR retrieval results of 11 languages and a new classification for keyword management methods are also presented.

Findings

The main results of the paper are an overall comparison of reductive and generative keyword management methods in terms of retrieval effectiveness and other broader criteria.

Originality/value

The paper is of value to anyone who wants to get an overall picture of keyword management techniques used in IR.

Details

Journal of Documentation, vol. 65 no. 2
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 1 February 1996

ROBYN SCHINKE, MARK GREENGRASS, ALEXANDER M. ROBERTSON and PETER WILLETT

This paper describes the design of a stemming algorithm for searching databases of Latin text. The algorithm uses a simple longest‐match approach with some recoding but differs…

Abstract

This paper describes the design of a stemming algorithm for searching databases of Latin text. The algorithm uses a simple longest‐match approach with some recoding but differs from most stemmers in its use of two separate suffix dictionaries (one for nouns and adjectives and one for verbs) for processing query and database words. These dictionaries and the associated stemming rules are arranged in such a way that the stemmer does not need to know the grammatical category of the word that is being stemmed. It is very easy to overstem in Latin: the stemmer developed here tends, rather, towards understemming, leaving sufficient grammatical information attached to the stems resulting from its use to enable users to pursue very specific searches for single grammatical forms of individual words.

Details

Journal of Documentation, vol. 52 no. 2
Type: Research Article
ISSN: 0022-0418

Article
Publication date: 1 August 2005

Kimmo Kettunen, Tuomas Kunttu and Kalervo Järvelin

To show that stem generation compares well with lemmatization as a morphological tool for a highly inflectional language for IR purposes in a best‐match retrieval system.

Abstract

Purpose

To show that stem generation compares well with lemmatization as a morphological tool for a highly inflectional language for IR purposes in a best‐match retrieval system.

Design/methodology/approach

Effects of three different morphological methods – lemmatization, stemming and stem production – for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four‐point relevance scale which is partitioned differently in different test settings.

Findings

Results show that stem production, a lighter method than morphological lemmatization, compares well with lemmatization in a best‐match IR environment. Differences in performance between stem production and lemmatization are small and they are not statistically significant in most of the tested settings. It is also shown that hitherto a rather neglected method of morphological processing for Finnish, stemming, performs reasonably well although the stemmer used – a Porter stemmer implementation – is far from optimal for a morphologically complex language like Finnish. In another series of tests, the effects of compound splitting and derivational expansion of queries are tested.

Practical implications

Usefulness of morphological lemmatization and stem generation for IR purposes can be estimated with many factors. On the average P‐R level they seem to behave very close to each other in a probabilistic IR system. Thus, the choice of the used method with highly inflectional languages needs to be estimated along other dimensions too.

Originality/value

Results are achieved using Finnish as an example of a highly inflectional language. The results are of interest for anyone who is interested in processing of morphological variation of a highly inflected language for IR purposes.

Details

Journal of Documentation, vol. 61 no. 4
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 1 August 2005

Carmen Galvez, Félix de Moya‐Anegón and Víctor H. Solana

To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of…

1323

Abstract

Purpose

To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques.

Design/methodology/approach

Presents a range of term conflation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non‐linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern‐matching, through equivalence relations represented in finite‐state transducers (FST), are emerging methods for the recognition and standardization of terms.

Findings

The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conflation method.

Originality/value

Outlines the importance of FSTs for the normalization of term variants.

Details

Journal of Documentation, vol. 61 no. 4
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 3 June 2019

Bilal Hawashin, Shadi Alzubi, Tarek Kanan and Ayman Mansour

This paper aims to propose a new efficient semantic recommender method for Arabic content.

Abstract

Purpose

This paper aims to propose a new efficient semantic recommender method for Arabic content.

Design/methodology/approach

Three semantic similarities were proposed to be integrated with the recommender system to improve its ability to recommend based on the semantic aspect. The proposed similarities are CHI-based semantic similarity, singular value decomposition (SVD)-based semantic similarity and Arabic WordNet-based semantic similarity. These similarities were compared with the existing similarities used by recommender systems from the literature.

Findings

Experiments show that the proposed semantic method using CHI-based similarity and using SVD-based similarity are more efficient than the existing methods on Arabic text in term of accuracy and execution time.

Originality/value

Although many previous works proposed recommender system methods for English text, very few works concentrated on Arabic Text. The field of Arabic Recommender Systems is largely understudied in the literature. Aside from this, there is a vital need to consider the semantic relationships behind user preferences to improve the accuracy of the recommendations. The contributions of this work are the following. First, as many recommender methods were proposed for English text and have never been tested on Arabic text, this work compares the performance of these widely used methods on Arabic text. Second, it proposes a novel semantic recommender method for Arabic text. As this method uses semantic similarity, three novel base semantic similarities were proposed and evaluated. Third, this work would direct the attention to more studies in this understudied topic in the literature.

Article
Publication date: 1 June 2001

Ari Pirkola

This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language…

1130

Abstract

This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of every language in the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological issues. A common theoretical framework is needed in particular because of the increasing significance of cross‐language retrieval research and CLIR systems processing different languages. The paper elaborates the linguistic morphological typology for the purposes of IR research. It studies how the indexes of synthesis and fusion could be used as practical tools in mono‐ and cross‐lingual IR research. The need for semantic and syntactic typologies is discussed. The paper also reviews studies made in different languages on the effects of morphology and stemming in IR.

Details

Journal of Documentation, vol. 57 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 22 August 2008

Majed Sanan, Mahmoud Rammal and Khaldoun Zreik

Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center…

Abstract

Purpose

Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center has to classify new documents based on these documents. This paper aims to study and explain the useful application of supervised learning method on Arabic texts using N‐gram as an indexing method (n  =  3).

Design/methodology/approach

The Lebanese official journal documents are categorized into several classes. Supposing that we know the class(es) of some documents (called learning texts), this can help to determine the candidate words of each class by segmenting the documents.

Findings

Results showed that N‐gram text classification using the cosine coefficient measure outperforms classification using Dice's measure and TF*ICF weight. Then it is the best between the three measures but it still insufficient. N‐gram method is good, but still insufficient for the classification of Arabic documents, and then it is necessary to look at the future of a new approach like distributional or symbolic approach in order to increase the effectiveness.

Originality/value

The results could be used to improve Arabic document classification (using software also). This work has evaluated a number of similarity measures for the classification of Arabic documents, using the Lebanese parliament documents and especially the Lebanese official journal documents Arabic corpus as the test bed.

Details

Interactive Technology and Smart Education, vol. 5 no. 3
Type: Research Article
ISSN: 1741-5659

Keywords

Article
Publication date: 28 August 2007

A. Albu‐Schäffer, S. Haddadin, Ch. Ott, A. Stemmer, T. Wimböck and G. Hirzinger

The paper seeks to present a new generation of torque‐controlled light‐weight robots (LWR) developed at the Institute of Robotics and Mechatronics of the German Aerospace Center.

10699

Abstract

Purpose

The paper seeks to present a new generation of torque‐controlled light‐weight robots (LWR) developed at the Institute of Robotics and Mechatronics of the German Aerospace Center.

Design/methodology/approach

An integrated mechatronic design approach for LWR is presented. Owing to the partially unknown properties of the environment, robustness of planning and control with respect to environmental variations is crucial. Robustness is achieved in this context through sensor redundancy and passivity‐based control. In the DLR root concept, joint torque sensing plays a central role.

Findings

In order to act in unstructured environments and interact with humans, the robots have design features and control/software functionalities which distinguish them from classical robots, such as: load‐to‐weight ratio of 1:1, torque sensing in the joints, active vibration damping, sensitive collision detection, compliant control on joint and Cartesian level.

Practical implications

The DLR robots are excellent research platforms for experimentation of advanced robotics algorithms. Space and medical robotics are further areas for which these robots were designed and hopefully will be applied within the next years. Potential industrial application fields are the fast automatic assembly as well as manufacturing activities done in cooperation with humans (industrial robot assistant). The described functionalities are of course highly relevant also for the potentially huge market of service robotics. The LWR technology was transferred to KUKA Roboter GmbH, which will bring the first arms on the market in the near future.

Originality/value

This paper introduces a new type of LWR with torque sensing in each joint and describes a consistent approach for using these sensors for manipulation in human environments. To the best of one's knowledge, the first systematic experimental evaluation of possible injuries during robot‐human crashes using standardized testing facilities is presented.

Details

Industrial Robot: An International Journal, vol. 34 no. 5
Type: Research Article
ISSN: 0143-991X

Keywords

Article
Publication date: 13 September 2018

Yaghoub Norouzi and Hoda Homavandi

The purpose of this paper is to investigate image search and retrieval problems in selected search engines in relation to Persian writing style challenges.

Abstract

Purpose

The purpose of this paper is to investigate image search and retrieval problems in selected search engines in relation to Persian writing style challenges.

Design/methodology/approach

This study is an applied one, and to answer the questions the authors used an evaluative research method. The aim of the research is to explore the morphological and semantic problems of Persian language in connection with image search and retrieval among the three major and widespread search engines: Google, Yahoo and Bing. In order to collect the data, a checklist designed by the researcher was used and then the data were analyzed by descriptive and inferential statistics.

Findings

The results indicate that Google, Yahoo and Bing search engines do not pay enough attention to morphological and semantic features of Persian language in image search and retrieval. This research reveals that six groups of Persian language features include derived words, derived/compound words, Persian and Arabic Plural words, use of dotted T and the use of spoken language and polysemy, which are the major problems in this area. In addition, the results suggest that Google is the best search engine of all in terms of compatibility with Persian language features.

Originality/value

This study investigated some new aspects of the above-mentioned subject through combining morphological and semantic aspects of Persian language with image search and retrieval. Therefore, this study is an interdisciplinary research, the results of which would help both to offer some solutions and to carry out similar research on this subject area. This study will also fill a gap in research studies conducted so far in this area in Farsi language, especially in image search and retrieval. Moreover, findings of this study can help to bridge the gap between the user’s questions and search engines (systems) retrievals. In addition, the methodology of this paper provides a framework for further research on image search and retrieval in databases and search engines.

Details

Online Information Review, vol. 42 no. 6
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 1 May 2006

Carmen Galvez and Félix de Moya‐Anegón

To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).

Abstract

Purpose

To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).

Design/methodology/approach

Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm.

Findings

The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms.

Originality/value

The report outlines the potential of transducers in their application to normalization processes.

Details

Journal of Documentation, vol. 62 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 10 of 161