Search results
1 – 10 of over 30000Yousuke Watanabe, Hidetaka Kamigaito and Haruo Yokota
Office documents are widely used in our daily activities, so the number of them has been increasing. A demand for sophisticated search for office documents becomes more important…
Abstract
Purpose
Office documents are widely used in our daily activities, so the number of them has been increasing. A demand for sophisticated search for office documents becomes more important. The recent file format of office documents is based on a package of multiple XML files. These XML files include not only body text but also page structure data and style data. The purpose of this paper is to utilize them to find similar office documents.
Design/methodology/approach
The authors propose SOS, a similarity search method based on structures and styles of office documents. SOS needs to compute similarity values between multiple pairs of XML files included in the office documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm.
Findings
SOS and LAX+ are evaluated by using three types of office documents (docx, xlsx and pptx) in our experiments. The results of LAX+ and SOS are better than ones of the existing algorithms.
Originality/value
Existing text‐based search engines do not take structure and style of documents into account. SOS can find similar documents by calculating similarities between multiple XML files corresponding to body texts, structures and styles.
Details
Keywords
Martin Nečaský, Petr Škoda, David Bernhauer, Jakub Klímek and Tomáš Skopal
Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking…
Abstract
Purpose
Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.
Design/methodology/approach
In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.
Findings
The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.
Originality/value
To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.
Details
Keywords
Cong-Phuoc Phan, Hong-Quang Nguyen and Tan-Tai Nguyen
Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its…
Abstract
Purpose
Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its potential, searching these patent documents has increasingly become an important topic. Although much research has processed a large size of collections, a few studies have attempted to integrate both patent classifications and specifications for analyzing user queries. Consequently, the queries are often insufficiently analyzed for improving the accuracy of search results. This paper aims to address such limitation by exploiting semantic relationships between patent contents and their classification.
Design/methodology/approach
The contributions are fourfold. First, the authors enhance similarity measurement between two short sentences and make it 20 per cent more accurate. Second, the Graph-embedded Tree ontology is enriched by integrating both patent documents and classification scheme. Third, the ontology does not rely on rule-based method or text matching; instead, an heuristic meaning comparison to extract semantic relationships between concepts is applied. Finally, the patent search approach uses the ontology effectively with the results sorted based on their most common order.
Findings
The experiment on searching for 600 patent documents in the field of Logistics brings better 15 per cent in terms of F-Measure when compared with traditional approaches.
Research limitations/implications
The research, however, still requires improvement in which the terms and phrases extracted by Noun and Noun phrases making less sense in some aspect and thus might not result in high accuracy. The large collection of extracted relationships could be further optimized for its conciseness. In addition, parallel processing such as Map-Reduce could be further used to improve the search processing performance.
Practical implications
The experimental results could be used for scientists and technologists to search for novel, non-obvious technologies in the patents.
Social implications
High quality of patent search results will reduce the patent infringement.
Originality/value
The proposed ontology is semantically enriched by integrating both patent documents and their classification. This ontology facilitates the analysis of the user queries for enhancing the accuracy of the patent search results.
Details
Keywords
Young Man Ko, Min Sun Song and Seung Jun Lee
This study aims to develop metadata of conceptual elements based on the text structure of research articles on Korean studies, to propose a search algorithm that reflects the…
Abstract
Purpose
This study aims to develop metadata of conceptual elements based on the text structure of research articles on Korean studies, to propose a search algorithm that reflects the combination of semantically relevant data in accordance with the search intention of research paper and to examine the algorithm whether there is a difference in the intention-based search results.
Design/methodology/approach
This study constructed a metadata database of 5,007 research articles on Korean studies arranged by conceptual elements of text structure and developed F1(w)-score weighted to conceptual elements based on the F1-score and the number of data points from each element. This study evaluated the algorithm by comparing search results of the F1(w)-score algorithm with those of the Term Frequency- Inverse Document Frequency (TF-IDF) algorithm and simple keyword search.
Findings
The authors find that the higher the F1(w)-score, the closer the semantic relevance of search intention. Furthermore, F1(w)-score generated search results were more closely related to the search intention than those of TF-IDF and simple keyword search.
Research limitations/implications
Even though the F1(w)-score was developed in this study to evaluate the search results of metadata database structured by conceptual elements of text structure of Korean studies, the algorithm can be used as a tool for searching the database which is a tuning process of weighting required.
Practical implications
A metadata database based on text structure and a search method based on weights of metadata elements – F1(w)-score – can be useful for interdisciplinary studies, especially for semantic search in regional studies.
Originality/value
This paper presents a methodology for supporting IR using F1(w)-score—a novel model for weighting metadata elements based on text structure. The F1(w)-score-based search results show the combination of semantically relevant data, which are otherwise difficult to search for using similarity of search words.
Details
Keywords
– The purpose of this paper is to introduce a new similarity method to gauge the differences between two subject hierarchical structures.
Abstract
Purpose
The purpose of this paper is to introduce a new similarity method to gauge the differences between two subject hierarchical structures.
Design/methodology/approach
In the proposed similarity measure, nodes on two hierarchical structures are projected onto a two-dimensional space, respectively, and both structural similarity and subject similarity of nodes are considered in the similarity between the two hierarchical structures. The extent to which the structural similarity impacts on the similarity can be controlled by adjusting a parameter. An experiment was conducted to evaluate soundness of the measure. Eight experts whose research interests were information retrieval and information organization participated in the study. Results from the new measure were compared with results from the experts.
Findings
The evaluation shows strong correlations between the results from the new method and the results from the experts. It suggests that the similarity method achieved satisfactory results.
Practical implications
Hierarchical structures that are found in subject directories, taxonomies, classification systems, and other classificatory structures play an extremely important role in information organization and information representation. Measuring the similarity between two subject hierarchical structures allows an accurate overarching understanding of the degree to which the two hierarchical structures are similar.
Originality/value
Both structural similarity and subject similarity of nodes were considered in the proposed similarity method, and the extent to which the structural similarity impacts on the similarity can be adjusted. In addition, a new evaluation method for a hierarchical structure similarity was presented.
Details
Keywords
Jean‐Claude Usunier and Stéphane Sbizzera
Local marketing decisions are too often made on a dichotomous basis, either standardize or fully adapt. However, similarities are too substantial and differences go too deep to be…
Abstract
Purpose
Local marketing decisions are too often made on a dichotomous basis, either standardize or fully adapt. However, similarities are too substantial and differences go too deep to be ignored. This article aims to articulate similarities and differences in local consumer experience across multiple contexts.
Design/methodology/approach
Language, being used daily in local contexts, reflects local knowledge (Geertz). This paper shows how translation/back‐translation can be used as a discovery tool, along with depth interviews and checks of researcher interpretations by informants, to generate cognitive mapping of consumption and taste experiences. Local words, used as emic signals, are combined into full portraits of the local experiences as narratives linking people to products and taste. Local portraits can then be merged to derive commonalities emergent from within the contexts studied. The comparative thick description framework is applied to the bitterness and crunchiness taste experiences in ten countries (China, Croatia, El Salvador, France, Germany, Japan, Mexico, Thailand, Tunisia, Turkey) and nine languages.
Findings
Local experiences in several different languages and countries in different areas of the world can be surveyed, compared, and organized into cognitive maps (Eden), which highlight commonalities and differences between contexts. In essence, differences are qualitative, dealing with creolization patterns, local consumption experience, local preferences, perceptions, and associations.
Research limitations/implications
This approach can be considered as interpretive and, although driven by a systematic approach, depends on researcher and informant expertise and rigor.
Practical implications
Cognitive maps help evaluate cross‐national differences and similarities in local markets. The emergent similarities and differences are highly meaningful for glocalizing marketing strategies, in terms of advertising, branding, and packaging.
Originality/value
Significant insights derived from this method can be tested in a more traditional and applied manner. This allows quicker insights into new local marketplaces and a progressive enrichment of cognitive maps with new languages and countries.
Details
Keywords
Not only does the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers and letters on the topic…
Abstract
Not only does the problem of correcting spelling errors by computer have a long history, it is evidently of considerable current interest as papers and letters on the topic continue to appear rapidly. This is not surprising, since techniques useful in detecting and correcting mis‐spellings normally have other important applications. Moreover, both the power of small computers and the routine production of machine‐readable text have increased enormously over the last decade to the point where automatic spelling error detection/correction has become not only feasible but highly desirable.
The purpose of this paper is to propose a solution for automating the task of matching business process models and search for correspondences with regard to the model semantics…
Abstract
Purpose
The purpose of this paper is to propose a solution for automating the task of matching business process models and search for correspondences with regard to the model semantics, thus improving the efficiency of such works.
Design/methodology/approach
A method is proposed based on combining several semantic technologies. The research follows a design-science-oriented approach in that a method together with its supporting artifacts has been engineered. It application allows for reusing legacy models and automatedly determining semantic similarity.
Findings
The method has been applied and the first findings suggest the effectiveness of the approach. The results of applying the method show its feasibility and significance. The suggested heuristic computing of semantic correspondences between semantically heterogeneous business process models is flexible and can support domain users.
Research limitations/implications
Even though a solution can be offered that is directly usable, so far the full complexity of the natural language as given in model element labels is not yet completely resolvable. Here further research could contribute to the potential optimizations and refinement of automatic matching and linguistic procedures. However, an open research question could be solved.
Practical implications
The method presented is aimed at adding to the methods in the field of business process management and could extend the possibilities of automating support for business analysis.
Originality/value
The suggested combination of semantic technologies is innovative and addresses the aspect of semantic heterogeneity in a holistic, which is novel to the field.
Details
Keywords
The purpose of this paper is to elaborate the picture of strategies for information searching and seeking by reviewing the conceptualizations on this topic in the field of library…
Abstract
Purpose
The purpose of this paper is to elaborate the picture of strategies for information searching and seeking by reviewing the conceptualizations on this topic in the field of library and information science (LIS).
Design/methodology/approach
The study draws on Henry Mintzberg’s idea of strategy as plan and strategy as pattern in a stream of actions. Conceptual analysis of 57 LIS investigations was conducted to find out how researchers have approached the above aspects in the characterizations of information search and seeking strategies.
Findings
In the conceptualizations of information search and information seeking strategies, the aspect of strategy as plan is explicated most clearly in text-book approaches describing the steps of rational web searching. Most conceptualizations focus on the aspect of strategy as pattern in a stream of actions. This approach places the main emphasis on realized strategies, either deliberate or emergent. Deliberate strategies indicate how information search or information seeking processes were oriented by intentions that existed previously. Emergent strategies indicate how patterns in information seeking and seeking developed in the absence of intentions, or despite them.
Research limitations/implications
The conceptualizations of the shifts in information seeking and searching strategies were excluded from the study. Similarly, conceptualizations of information search or information retrieval tactics were not examined.
Originality/value
The study pioneers by providing an in-depth analysis of the ways in which the key aspects of strategy are conceptualized in the classifications and typologies of information seeking and searching strategies. The findings contribute to the elaboration of the conceptual space of information behaviour research.
Details
Keywords
This paper provides an introduction to research in the field of image forensics and asks whether advances in the field of algorithm development and digital forensics will…
Abstract
Purpose
This paper provides an introduction to research in the field of image forensics and asks whether advances in the field of algorithm development and digital forensics will facilitate the examination of images in the scientific publication process in the near future.
Design/methodology/approach
This study looks at the status quo of image analysis in the peer review process and evaluates selected articles from the field of Digital Image and Signal Processing that have addressed the discovery of copy-move, cut-paste and erase-fill manipulations.
Findings
The article focuses on forensic research and shows that, despite numerous efforts, there is still no applicable tool for the automated detection of image manipulation. Nonetheless, the status quo for examining images in scientific publications remains visual inspection and will likely remain so for the foreseeable future. This study summarizes aspects that make automated detection of image manipulation difficult from a forensic research perspective.
Research limitations/implications
Results of this study underscore the need for a conceptual reconsideration of the problems involving image manipulation with a view toward the need for interdisciplinary collaboration in conjunction with library and information science (LIS) expertise on information integrity.
Practical implications
This study not only identifies a number of conceptual challenges but also suggests areas of action that the scientific community can address in the future.
Originality/value
Image manipulation is often discussed in isolation as a technical challenge. This study takes a more holistic view of the topic and demonstrates the necessity for a multidisciplinary approach.
Details