Search results

1 – 10 of over 30000
Article
Publication date: 14 June 2013

Yousuke Watanabe, Hidetaka Kamigaito and Haruo Yokota

Office documents are widely used in our daily activities, so the number of them has been increasing. A demand for sophisticated search for office documents becomes more important…

Abstract

Purpose

Office documents are widely used in our daily activities, so the number of them has been increasing. A demand for sophisticated search for office documents becomes more important. The recent file format of office documents is based on a package of multiple XML files. These XML files include not only body text but also page structure data and style data. The purpose of this paper is to utilize them to find similar office documents.

Design/methodology/approach

The authors propose SOS, a similarity search method based on structures and styles of office documents. SOS needs to compute similarity values between multiple pairs of XML files included in the office documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm.

Findings

SOS and LAX+ are evaluated by using three types of office documents (docx, xlsx and pptx) in our experiments. The results of LAX+ and SOS are better than ones of the existing algorithms.

Originality/value

Existing text‐based search engines do not take structure and style of documents into account. SOS can find similar documents by calculating similarities between multiple XML files corresponding to body texts, structures and styles.

Open Access
Article
Publication date: 15 February 2022

Martin Nečaský, Petr Škoda, David Bernhauer, Jakub Klímek and Tomáš Skopal

Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking…

1207

Abstract

Purpose

Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.

Design/methodology/approach

In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.

Findings

The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.

Originality/value

To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

Details

Data Technologies and Applications, vol. 56 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 3 December 2018

Cong-Phuoc Phan, Hong-Quang Nguyen and Tan-Tai Nguyen

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its…

Abstract

Purpose

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its potential, searching these patent documents has increasingly become an important topic. Although much research has processed a large size of collections, a few studies have attempted to integrate both patent classifications and specifications for analyzing user queries. Consequently, the queries are often insufficiently analyzed for improving the accuracy of search results. This paper aims to address such limitation by exploiting semantic relationships between patent contents and their classification.

Design/methodology/approach

The contributions are fourfold. First, the authors enhance similarity measurement between two short sentences and make it 20 per cent more accurate. Second, the Graph-embedded Tree ontology is enriched by integrating both patent documents and classification scheme. Third, the ontology does not rely on rule-based method or text matching; instead, an heuristic meaning comparison to extract semantic relationships between concepts is applied. Finally, the patent search approach uses the ontology effectively with the results sorted based on their most common order.

Findings

The experiment on searching for 600 patent documents in the field of Logistics brings better 15 per cent in terms of F-Measure when compared with traditional approaches.

Research limitations/implications

The research, however, still requires improvement in which the terms and phrases extracted by Noun and Noun phrases making less sense in some aspect and thus might not result in high accuracy. The large collection of extracted relationships could be further optimized for its conciseness. In addition, parallel processing such as Map-Reduce could be further used to improve the search processing performance.

Practical implications

The experimental results could be used for scientists and technologists to search for novel, non-obvious technologies in the patents.

Social implications

High quality of patent search results will reduce the patent infringement.

Originality/value

The proposed ontology is semantically enriched by integrating both patent documents and their classification. This ontology facilitates the analysis of the user queries for enhancing the accuracy of the patent search results.

Details

International Journal of Web Information Systems, vol. 15 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 16 July 2021

Young Man Ko, Min Sun Song and Seung Jun Lee

This study aims to develop metadata of conceptual elements based on the text structure of research articles on Korean studies, to propose a search algorithm that reflects the…

Abstract

Purpose

This study aims to develop metadata of conceptual elements based on the text structure of research articles on Korean studies, to propose a search algorithm that reflects the combination of semantically relevant data in accordance with the search intention of research paper and to examine the algorithm whether there is a difference in the intention-based search results.

Design/methodology/approach

This study constructed a metadata database of 5,007 research articles on Korean studies arranged by conceptual elements of text structure and developed F1(w)-score weighted to conceptual elements based on the F1-score and the number of data points from each element. This study evaluated the algorithm by comparing search results of the F1(w)-score algorithm with those of the Term Frequency- Inverse Document Frequency (TF-IDF) algorithm and simple keyword search.

Findings

The authors find that the higher the F1(w)-score, the closer the semantic relevance of search intention. Furthermore, F1(w)-score generated search results were more closely related to the search intention than those of TF-IDF and simple keyword search.

Research limitations/implications

Even though the F1(w)-score was developed in this study to evaluate the search results of metadata database structured by conceptual elements of text structure of Korean studies, the algorithm can be used as a tool for searching the database which is a tuning process of weighting required.

Practical implications

A metadata database based on text structure and a search method based on weights of metadata elements – F1(w)-score – can be useful for interdisciplinary studies, especially for semantic search in regional studies.

Originality/value

This paper presents a methodology for supporting IR using F1(w)-score—a novel model for weighting metadata elements based on text structure. The F1(w)-score-based search results show the combination of semantically relevant data, which are otherwise difficult to search for using similarity of search words.

Details

The Electronic Library , vol. 39 no. 5
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 6 May 2014

Jin Zhang and Marcia Lei Zeng

– The purpose of this paper is to introduce a new similarity method to gauge the differences between two subject hierarchical structures.

Abstract

Purpose

The purpose of this paper is to introduce a new similarity method to gauge the differences between two subject hierarchical structures.

Design/methodology/approach

In the proposed similarity measure, nodes on two hierarchical structures are projected onto a two-dimensional space, respectively, and both structural similarity and subject similarity of nodes are considered in the similarity between the two hierarchical structures. The extent to which the structural similarity impacts on the similarity can be controlled by adjusting a parameter. An experiment was conducted to evaluate soundness of the measure. Eight experts whose research interests were information retrieval and information organization participated in the study. Results from the new measure were compared with results from the experts.

Findings

The evaluation shows strong correlations between the results from the new method and the results from the experts. It suggests that the similarity method achieved satisfactory results.

Practical implications

Hierarchical structures that are found in subject directories, taxonomies, classification systems, and other classificatory structures play an extremely important role in information organization and information representation. Measuring the similarity between two subject hierarchical structures allows an accurate overarching understanding of the degree to which the two hierarchical structures are similar.

Originality/value

Both structural similarity and subject similarity of nodes were considered in the proposed similarity method, and the extent to which the structural similarity impacts on the similarity can be adjusted. In addition, a new evaluation method for a hierarchical structure similarity was presented.

Details

Journal of Documentation, vol. 70 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 15 February 2013

Jean‐Claude Usunier and Stéphane Sbizzera

Local marketing decisions are too often made on a dichotomous basis, either standardize or fully adapt. However, similarities are too substantial and differences go too deep to be…

1621

Abstract

Purpose

Local marketing decisions are too often made on a dichotomous basis, either standardize or fully adapt. However, similarities are too substantial and differences go too deep to be ignored. This article aims to articulate similarities and differences in local consumer experience across multiple contexts.

Design/methodology/approach

Language, being used daily in local contexts, reflects local knowledge (Geertz). This paper shows how translation/back‐translation can be used as a discovery tool, along with depth interviews and checks of researcher interpretations by informants, to generate cognitive mapping of consumption and taste experiences. Local words, used as emic signals, are combined into full portraits of the local experiences as narratives linking people to products and taste. Local portraits can then be merged to derive commonalities emergent from within the contexts studied. The comparative thick description framework is applied to the bitterness and crunchiness taste experiences in ten countries (China, Croatia, El Salvador, France, Germany, Japan, Mexico, Thailand, Tunisia, Turkey) and nine languages.

Findings

Local experiences in several different languages and countries in different areas of the world can be surveyed, compared, and organized into cognitive maps (Eden), which highlight commonalities and differences between contexts. In essence, differences are qualitative, dealing with creolization patterns, local consumption experience, local preferences, perceptions, and associations.

Research limitations/implications

This approach can be considered as interpretive and, although driven by a systematic approach, depends on researcher and informant expertise and rigor.

Practical implications

Cognitive maps help evaluate cross‐national differences and similarities in local markets. The emergent similarities and differences are highly meaningful for glocalizing marketing strategies, in terms of advertising, branding, and packaging.

Originality/value

Significant insights derived from this method can be tested in a more traditional and applied manner. This allows quicker insights into new local marketplaces and a progressive enrichment of cognitive maps with new languages and countries.

Article
Publication date: 1 April 1982

J.J. POLLOCK

Not only does the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers and letters on the topic…

Abstract

Not only does the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers and letters on the topic continue to appear rapidly. This is not surprising, since techniques useful in detecting and correcting mis‐spellings normally have other important applications. Moreover, both the power of small computers and the routine production of machine‐readable text have increased enormously over the last decade to the point where automatic spelling error detection/correction has become not only feasible but highly desirable.

Details

Journal of Documentation, vol. 38 no. 4
Type: Research Article
ISSN: 0022-0418

Article
Publication date: 1 July 2014

Janina Fengel

The purpose of this paper is to propose a solution for automating the task of matching business process models and search for correspondences with regard to the model semantics…

Abstract

Purpose

The purpose of this paper is to propose a solution for automating the task of matching business process models and search for correspondences with regard to the model semantics, thus improving the efficiency of such works.

Design/methodology/approach

A method is proposed based on combining several semantic technologies. The research follows a design-science-oriented approach in that a method together with its supporting artifacts has been engineered. It application allows for reusing legacy models and automatedly determining semantic similarity.

Findings

The method has been applied and the first findings suggest the effectiveness of the approach. The results of applying the method show its feasibility and significance. The suggested heuristic computing of semantic correspondences between semantically heterogeneous business process models is flexible and can support domain users.

Research limitations/implications

Even though a solution can be offered that is directly usable, so far the full complexity of the natural language as given in model element labels is not yet completely resolvable. Here further research could contribute to the potential optimizations and refinement of automatic matching and linguistic procedures. However, an open research question could be solved.

Practical implications

The method presented is aimed at adding to the methods in the field of business process management and could extend the possibilities of automating support for business analysis.

Originality/value

The suggested combination of semantic technologies is innovative and addresses the aspect of semantic heterogeneity in a holistic, which is novel to the field.

Article
Publication date: 10 October 2016

Reijo Savolainen

The purpose of this paper is to elaborate the picture of strategies for information searching and seeking by reviewing the conceptualizations on this topic in the field of library…

2972

Abstract

Purpose

The purpose of this paper is to elaborate the picture of strategies for information searching and seeking by reviewing the conceptualizations on this topic in the field of library and information science (LIS).

Design/methodology/approach

The study draws on Henry Mintzberg’s idea of strategy as plan and strategy as pattern in a stream of actions. Conceptual analysis of 57 LIS investigations was conducted to find out how researchers have approached the above aspects in the characterizations of information search and seeking strategies.

Findings

In the conceptualizations of information search and information seeking strategies, the aspect of strategy as plan is explicated most clearly in text-book approaches describing the steps of rational web searching. Most conceptualizations focus on the aspect of strategy as pattern in a stream of actions. This approach places the main emphasis on realized strategies, either deliberate or emergent. Deliberate strategies indicate how information search or information seeking processes were oriented by intentions that existed previously. Emergent strategies indicate how patterns in information seeking and seeking developed in the absence of intentions, or despite them.

Research limitations/implications

The conceptualizations of the shifts in information seeking and searching strategies were excluded from the study. Similarly, conceptualizations of information search or information retrieval tactics were not examined.

Originality/value

The study pioneers by providing an in-depth analysis of the ways in which the key aspects of strategy are conceptualized in the classifications and typologies of information seeking and searching strategies. The findings contribute to the elaboration of the conceptual space of information behaviour research.

Details

Journal of Documentation, vol. 72 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 7 December 2021

Thorsten Stephan Beck

This paper provides an introduction to research in the field of image forensics and asks whether advances in the field of algorithm development and digital forensics will…

Abstract

Purpose

This paper provides an introduction to research in the field of image forensics and asks whether advances in the field of algorithm development and digital forensics will facilitate the examination of images in the scientific publication process in the near future.

Design/methodology/approach

This study looks at the status quo of image analysis in the peer review process and evaluates selected articles from the field of Digital Image and Signal Processing that have addressed the discovery of copy-move, cut-paste and erase-fill manipulations.

Findings

The article focuses on forensic research and shows that, despite numerous efforts, there is still no applicable tool for the automated detection of image manipulation. Nonetheless, the status quo for examining images in scientific publications remains visual inspection and will likely remain so for the foreseeable future. This study summarizes aspects that make automated detection of image manipulation difficult from a forensic research perspective.

Research limitations/implications

Results of this study underscore the need for a conceptual reconsideration of the problems involving image manipulation with a view toward the need for interdisciplinary collaboration in conjunction with library and information science (LIS) expertise on information integrity.

Practical implications

This study not only identifies a number of conceptual challenges but also suggests areas of action that the scientific community can address in the future.

Originality/value

Image manipulation is often discussed in isolation as a technical challenge. This study takes a more holistic view of the topic and demonstrates the necessity for a multidisciplinary approach.

Details

Journal of Documentation, vol. 78 no. 5
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 10 of over 30000