Search results

1 – 10 of 27
Open Access
Article
Publication date: 15 February 2022

Martin Nečaský, Petr Škoda, David Bernhauer, Jakub Klímek and Tomáš Skopal

Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking…

1221

Abstract

Purpose

Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.

Design/methodology/approach

In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.

Findings

The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.

Originality/value

To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

Details

Data Technologies and Applications, vol. 56 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Open Access
Article
Publication date: 11 July 2023

Miroslav Despotovic, David Koch, Eric Stumpe, Wolfgang A. Brunauer and Matthias Zeppelzauer

In this study the authors aim to outline new ways of information extraction for automated valuation models, which in turn would help to increase transparency in valuation…

Abstract

Purpose

In this study the authors aim to outline new ways of information extraction for automated valuation models, which in turn would help to increase transparency in valuation procedures and thus contribute to more reliable statements about the value of real estate.

Design/methodology/approach

The authors hypothesize that empirical error in the interpretation and qualitative assessment of visual content can be minimized by collating the assessments of multiple individuals and through use of repeated trials. Motivated by this problem, the authors developed an experimental approach for semi-automatic extraction of qualitative real estate metadata based on Comparative Judgments and Deep Learning. The authors evaluate the feasibility of our approach with the help of Hedonic Models.

Findings

The results show that the collated assessments of qualitative features of interior images show a notable effect on the price models and thus over potential for further research within this paradigm.

Originality/value

To the best of the authors’ knowledge, this is the first approach that combines and collates the subjective ratings of visual features and deep learning for real estate use cases.

Details

Journal of European Real Estate Research, vol. 16 no. 2
Type: Research Article
ISSN: 1753-9269

Keywords

Open Access
Article
Publication date: 28 July 2020

Julián Monsalve-Pulido, Jose Aguilar, Edwin Montoya and Camilo Salazar

This article proposes an architecture of an intelligent and autonomous recommendation system to be applied to any virtual learning environment, with the objective of efficiently…

1957

Abstract

This article proposes an architecture of an intelligent and autonomous recommendation system to be applied to any virtual learning environment, with the objective of efficiently recommending digital resources. The paper presents the architectural details of the intelligent and autonomous dimensions of the recommendation system. The paper describes a hybrid recommendation model that orchestrates and manages the available information and the specific recommendation needs, in order to determine the recommendation algorithms to be used. The hybrid model allows the integration of the approaches based on collaborative filter, content or knowledge. In the architecture, information is extracted from four sources: the context, the students, the course and the digital resources, identifying variables, such as individual learning styles, socioeconomic information, connection characteristics, location, etc. Tests were carried out for the creation of an academic course, in order to analyse the intelligent and autonomous capabilities of the architecture.

Details

Applied Computing and Informatics, vol. 20 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Open Access
Article
Publication date: 6 July 2020

Basma Makhlouf Shabou, Julien Tièche, Julien Knafou and Arnaud Gaudinat

This paper aims to describe an interdisciplinary and innovative research conducted in Switzerland, at the Geneva School of Business Administration HES-SO and supported by the…

4249

Abstract

Purpose

This paper aims to describe an interdisciplinary and innovative research conducted in Switzerland, at the Geneva School of Business Administration HES-SO and supported by the State Archives of Neuchâtel (Office des archives de l'État de Neuchâtel, OAEN). The problem to be addressed is one of the most classical ones: how to extract and discriminate relevant data in a huge amount of diversified and complex data record formats and contents. The goal of this study is to provide a framework and a proof of concept for a software that helps taking defensible decisions on the retention and disposal of records and data proposed to the OAEN. For this purpose, the authors designed two axes: the archival axis, to propose archival metrics for the appraisal of structured and unstructured data, and the data mining axis to propose algorithmic methods as complementary or/and additional metrics for the appraisal process.

Design/methodology/approach

Based on two axes, this exploratory study designs and tests the feasibility of archival metrics that are paired to data mining metrics, to advance, as much as possible, the digital appraisal process in a systematic or even automatic way. Under Axis 1, the authors have initiated three steps: first, the design of a conceptual framework to records data appraisal with a detailed three-dimensional approach (trustworthiness, exploitability, representativeness). In addition, the authors defined the main principles and postulates to guide the operationalization of the conceptual dimensions. Second, the operationalization proposed metrics expressed in terms of variables supported by a quantitative method for their measurement and scoring. Third, the authors shared this conceptual framework proposing the dimensions and operationalized variables (metrics) with experienced professionals to validate them. The expert’s feedback finally gave the authors an idea on: the relevance and the feasibility of these metrics. Those two aspects may demonstrate the acceptability of such method in a real-life archival practice. In parallel, Axis 2 proposes functionalities to cover not only macro analysis for data but also the algorithmic methods to enable the computation of digital archival and data mining metrics. Based on that, three use cases were proposed to imagine plausible and illustrative scenarios for the application of such a solution.

Findings

The main results demonstrate the feasibility of measuring the value of data and records with a reproducible method. More specifically, for Axis 1, the authors applied the metrics in a flexible and modular way. The authors defined also the main principles needed to enable computational scoring method. The results obtained through the expert’s consultation on the relevance of 42 metrics indicate an acceptance rate above 80%. In addition, the results show that 60% of all metrics can be automated. Regarding Axis 2, 33 functionalities were developed and proposed under six main types: macro analysis, microanalysis, statistics, retrieval, administration and, finally, the decision modeling and machine learning. The relevance of metrics and functionalities is based on the theoretical validity and computational character of their method. These results are largely satisfactory and promising.

Originality/value

This study offers a valuable aid to improve the validity and performance of archival appraisal processes and decision-making. Transferability and applicability of these archival and data mining metrics could be considered for other types of data. An adaptation of this method and its metrics could be tested on research data, medical data or banking data.

Details

Records Management Journal, vol. 30 no. 2
Type: Research Article
ISSN: 0956-5698

Keywords

Open Access
Article
Publication date: 8 December 2020

Matjaž Kragelj and Mirjana Kljajić Borštnar

The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.

2921

Abstract

Purpose

The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.

Design/methodology/approach

The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model.

Findings

Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts.

Research limitations/implications

The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians.

Practical implications

The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases.

Social implications

The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable.

Originality/value

These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.

Details

Journal of Documentation, vol. 77 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 30 November 2021

Federico Barravecchia, Luca Mastrogiacomo and Fiorenzo Franceschini

Digital voice-of-customer (digital VoC) analysis is gaining much attention in the field of quality management. Digital VoC can be a great source of knowledge about customer needs…

1750

Abstract

Purpose

Digital voice-of-customer (digital VoC) analysis is gaining much attention in the field of quality management. Digital VoC can be a great source of knowledge about customer needs, habits and expectations. To this end, the most popular approach is based on the application of text mining algorithms named topic modelling. These algorithms can identify latent topics discussed within digital VoC and categorise each source (e.g. each review) based on its content. This paper aims to propose a structured procedure for validating the results produced by topic modelling algorithms.

Design/methodology/approach

The proposed procedure compares, on random samples, the results produced by topic modelling algorithms with those generated by human evaluators. The use of specific metrics allows to make a comparison between the two approaches and to provide a preliminary empirical validation.

Findings

The proposed procedure can address users of topic modelling algorithms in validating the obtained results. An application case study related to some car-sharing services supports the description.

Originality/value

Despite the vast success of topic modelling-based approaches, metrics and procedures to validate the obtained results are still lacking. This paper provides a first practical and structured validation procedure specifically employed for quality-related applications.

Details

International Journal of Quality & Reliability Management, vol. 39 no. 6
Type: Research Article
ISSN: 0265-671X

Keywords

Open Access
Article
Publication date: 23 May 2023

Kimmo Kettunen, Heikki Keskustalo, Sanna Kumpulainen, Tuula Pääkkönen and Juha Rautiainen

This study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different…

Abstract

Purpose

This study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users' subjective perception through an interactive information retrieval task with a collection of one digitized historical Finnish newspaper.

Design/methodology/approach

This study is based on the simulated work task model used in interactive information retrieval. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869–1918 which consists of ca. 1.45 million autosegmented articles. The article search database had two versions of each article with different quality OCR. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top 10 results using a graded relevance scale of 0–3. Users were not informed about the OCR quality differences of the otherwise identical articles.

Findings

The main result of the study is that improved OCR quality affects subjective user perception of historical newspaper articles positively: higher relevance scores are given to better-quality texts.

Originality/value

To the best of the authors’ knowledge, this simulated interactive work task experiment is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of an optically read text.

Details

Journal of Documentation, vol. 79 no. 7
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 18 April 2024

Joseph Nockels, Paul Gooding and Melissa Terras

This paper focuses on image-to-text manuscript processing through Handwritten Text Recognition (HTR), a Machine Learning (ML) approach enabled by Artificial Intelligence (AI)…

Abstract

Purpose

This paper focuses on image-to-text manuscript processing through Handwritten Text Recognition (HTR), a Machine Learning (ML) approach enabled by Artificial Intelligence (AI). With HTR now achieving high levels of accuracy, we consider its potential impact on our near-future information environment and knowledge of the past.

Design/methodology/approach

In undertaking a more constructivist analysis, we identified gaps in the current literature through a Grounded Theory Method (GTM). This guided an iterative process of concept mapping through writing sprints in workshop settings. We identified, explored and confirmed themes through group discussion and a further interrogation of relevant literature, until reaching saturation.

Findings

Catalogued as part of our GTM, 120 published texts underpin this paper. We found that HTR facilitates accurate transcription and dataset cleaning, while facilitating access to a variety of historical material. HTR contributes to a virtuous cycle of dataset production and can inform the development of online cataloguing. However, current limitations include dependency on digitisation pipelines, potential archival history omission and entrenchment of bias. We also cite near-future HTR considerations. These include encouraging open access, integrating advanced AI processes and metadata extraction; legal and moral issues surrounding copyright and data ethics; crediting individuals’ transcription contributions and HTR’s environmental costs.

Originality/value

Our research produces a set of best practice recommendations for researchers, data providers and memory institutions, surrounding HTR use. This forms an initial, though not comprehensive, blueprint for directing future HTR research. In pursuing this, the narrative that HTR’s speed and efficiency will simply transform scholarship in archives is deconstructed.

Open Access
Article
Publication date: 5 December 2023

Manuel J. Sánchez-Franco and Sierra Rey-Tienda

This research proposes to organise and distil this massive amount of data, making it easier to understand. Using data mining, machine learning techniques and visual approaches…

Abstract

Purpose

This research proposes to organise and distil this massive amount of data, making it easier to understand. Using data mining, machine learning techniques and visual approaches, researchers and managers can extract valuable insights (on guests' preferences) and convert them into strategic thinking based on exploration and predictive analysis. Consequently, this research aims to assist hotel managers in making informed decisions, thus improving the overall guest experience and increasing competitiveness.

Design/methodology/approach

This research employs natural language processing techniques, data visualisation proposals and machine learning methodologies to analyse unstructured guest service experience content. In particular, this research (1) applies data mining to evaluate the role and significance of critical terms and semantic structures in hotel assessments; (2) identifies salient tokens to depict guests' narratives based on term frequency and the information quantity they convey; and (3) tackles the challenge of managing extensive document repositories through automated identification of latent topics in reviews by using machine learning methods for semantic grouping and pattern visualisation.

Findings

This study’s findings (1) aim to identify critical features and topics that guests highlight during their hotel stays, (2) visually explore the relationships between these features and differences among diverse types of travellers through online hotel reviews and (3) determine predictive power. Their implications are crucial for the hospitality domain, as they provide real-time insights into guests' perceptions and business performance and are essential for making informed decisions and staying competitive.

Originality/value

This research seeks to minimise the cognitive processing costs of the enormous amount of content published by the user through a better organisation of hotel service reviews and their visualisation. Likewise, this research aims to propose a methodology and method available to tourism organisations to obtain truly useable knowledge in the design of the hotel offer and its value propositions.

Details

Management Decision, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0025-1747

Keywords

Open Access
Article
Publication date: 13 June 2023

Mikael Laakso

Science policy and practice for open access (OA) books is a rapidly evolving area in the scholarly domain. However, there is much that remains unknown, including how many OA books…

1568

Abstract

Purpose

Science policy and practice for open access (OA) books is a rapidly evolving area in the scholarly domain. However, there is much that remains unknown, including how many OA books there are and to what degree they are included in preservation coverage. The purpose of this study is to contribute towards filling this knowledge gap in order to advance both research and practice in the domain of OA books.

Design/methodology/approach

This study utilized open bibliometric data sources to aggregate a harmonized dataset of metadata records for OA books (data sources: the Directory of Open Access Books, OpenAIRE, OpenAlex, Scielo Books, The Lens, and WorldCat). This dataset was then cross-matched based on unique identifiers and book titles to openly available content listings of trusted preservation services (data sources: Cariniana Network, CLOCKSS, Global LOCKSS Network, and Portico). The web domains of the OA books were determined by querying the web addresses or digital object identifiers provided in the metadata of the bibliometric database entries.

Findings

In total, 396,995 unique records were identified from the OA book bibliometric sources, of which 19% were found to be included in at least one of the preservation services. The results suggest reason for concern for the long tail of OA books distributed at thousands of different web domains as these include volatile cloud storage or sometimes no longer contained the files at all.

Research limitations/implications

Data quality issues, varying definitions of OA across services and inconsistent implementation of unique identifiers were discovered as key challenges. The study includes recommendations for publishers, libraries, data providers and preservation services for improving monitoring and practices for OA book preservation.

Originality/value

This study provides methodological and empirical findings for advancing the practices of OA book publishing, preservation and research.

Details

Journal of Documentation, vol. 79 no. 7
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 10 of 27