Search results

1 – 10 of 24
Open Access
Article
Publication date: 21 June 2021

Bufei Xing, Haonan Yin, Zhijun Yan and Jiachen Wang

The purpose of this paper is to propose a new approach to retrieve similar questions in online health communities to improve the efficiency of health information retrieval and…

Abstract

Purpose

The purpose of this paper is to propose a new approach to retrieve similar questions in online health communities to improve the efficiency of health information retrieval and sharing.

Design/methodology/approach

This paper proposes a hybrid approach to combining domain knowledge similarity and topic similarity to retrieve similar questions in online health communities. The domain knowledge similarity can evaluate the domain distance between different questions. And the topic similarity measures questions’ relationship base on the extracted latent topics.

Findings

The experiment results show that the proposed method outperforms the baseline methods.

Originality/value

This method conquers the problem of word mismatch and considers the named entities included in questions, which most of existing studies did not.

Details

International Journal of Crowd Science, vol. 5 no. 2
Type: Research Article
ISSN: 2398-7294

Keywords

Open Access
Article
Publication date: 28 July 2020

Prabhat Pokharel, Roshan Pokhrel and Basanta Joshi

Analysis of log message is very important for the identification of a suspicious system and network activity. This analysis requires the correct extraction of variable entities…

1075

Abstract

Analysis of log message is very important for the identification of a suspicious system and network activity. This analysis requires the correct extraction of variable entities. The variable entities are extracted by comparing the logs messages against the log patterns. Each of these log patterns can be represented in the form of a log signature. In this paper, we present a hybrid approach for log signature extraction. The approach consists of two modules. The first module identifies log patterns by generating log clusters. The second module uses Named Entity Recognition (NER) to extract signatures by using the extracted log clusters. Experiments were performed on event logs from Windows Operating System, Exchange and Unix and validation of the result was done by comparing the signatures and the variable entities against the standard log documentation. The outcome of the experiments was that extracted signatures were ready to be used with a high degree of accuracy.

Details

Applied Computing and Informatics, vol. 19 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Open Access
Article
Publication date: 31 July 2023

Sara Lafia, David A. Bleckley and J. Trent Alexander

Many libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use…

Abstract

Purpose

Many libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill.

Design/methodology/approach

The authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources.

Findings

The authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public.

Originality/value

The authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.

Details

Journal of Documentation, vol. 79 no. 7
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 20 July 2020

Mohd Faizan, Raees Ahmad Khan and Alka Agrawal

Cryptomarkets on the dark web have emerged as a hub for the sale of illicit drugs. They have made it easier for the customers to get access to illicit drugs online while ensuring…

1182

Abstract

Cryptomarkets on the dark web have emerged as a hub for the sale of illicit drugs. They have made it easier for the customers to get access to illicit drugs online while ensuring their anonymity. The easy availability of potentially harmful drugs has resulted in a significant impact on public health. Consequently, law enforcement agencies put a lot of effort and resources into shutting down online markets on the dark web. A lot of research work has also been conducted to understand the working of customers and vendors involved in the cryptomarkets that may help the law enforcement agencies. In this research, we present a ranking methodology to identify and rank top markets dealing in harmful illicit drugs. Using named entity recognition, a harm score of a drug market is calculated to indicate the degree of threat followed by the ranking of drug markets. The top-ranked markets are the ones selling the most harmful drugs. The rankings thus obtained can be helpful to law enforcement agencies by locating specific markets selling harmful illicit drugs and their further monitoring.

Details

Applied Computing and Informatics, vol. 18 no. 3/4
Type: Research Article
ISSN: 2634-1964

Keywords

Open Access
Article
Publication date: 6 July 2020

Basma Makhlouf Shabou, Julien Tièche, Julien Knafou and Arnaud Gaudinat

This paper aims to describe an interdisciplinary and innovative research conducted in Switzerland, at the Geneva School of Business Administration HES-SO and supported by the…

4134

Abstract

Purpose

This paper aims to describe an interdisciplinary and innovative research conducted in Switzerland, at the Geneva School of Business Administration HES-SO and supported by the State Archives of Neuchâtel (Office des archives de l'État de Neuchâtel, OAEN). The problem to be addressed is one of the most classical ones: how to extract and discriminate relevant data in a huge amount of diversified and complex data record formats and contents. The goal of this study is to provide a framework and a proof of concept for a software that helps taking defensible decisions on the retention and disposal of records and data proposed to the OAEN. For this purpose, the authors designed two axes: the archival axis, to propose archival metrics for the appraisal of structured and unstructured data, and the data mining axis to propose algorithmic methods as complementary or/and additional metrics for the appraisal process.

Design/methodology/approach

Based on two axes, this exploratory study designs and tests the feasibility of archival metrics that are paired to data mining metrics, to advance, as much as possible, the digital appraisal process in a systematic or even automatic way. Under Axis 1, the authors have initiated three steps: first, the design of a conceptual framework to records data appraisal with a detailed three-dimensional approach (trustworthiness, exploitability, representativeness). In addition, the authors defined the main principles and postulates to guide the operationalization of the conceptual dimensions. Second, the operationalization proposed metrics expressed in terms of variables supported by a quantitative method for their measurement and scoring. Third, the authors shared this conceptual framework proposing the dimensions and operationalized variables (metrics) with experienced professionals to validate them. The expert’s feedback finally gave the authors an idea on: the relevance and the feasibility of these metrics. Those two aspects may demonstrate the acceptability of such method in a real-life archival practice. In parallel, Axis 2 proposes functionalities to cover not only macro analysis for data but also the algorithmic methods to enable the computation of digital archival and data mining metrics. Based on that, three use cases were proposed to imagine plausible and illustrative scenarios for the application of such a solution.

Findings

The main results demonstrate the feasibility of measuring the value of data and records with a reproducible method. More specifically, for Axis 1, the authors applied the metrics in a flexible and modular way. The authors defined also the main principles needed to enable computational scoring method. The results obtained through the expert’s consultation on the relevance of 42 metrics indicate an acceptance rate above 80%. In addition, the results show that 60% of all metrics can be automated. Regarding Axis 2, 33 functionalities were developed and proposed under six main types: macro analysis, microanalysis, statistics, retrieval, administration and, finally, the decision modeling and machine learning. The relevance of metrics and functionalities is based on the theoretical validity and computational character of their method. These results are largely satisfactory and promising.

Originality/value

This study offers a valuable aid to improve the validity and performance of archival appraisal processes and decision-making. Transferability and applicability of these archival and data mining metrics could be considered for other types of data. An adaptation of this method and its metrics could be tested on research data, medical data or banking data.

Details

Records Management Journal, vol. 30 no. 2
Type: Research Article
ISSN: 0956-5698

Keywords

Open Access
Article
Publication date: 23 November 2023

Reema Khaled AlRowais and Duaa Alsaeed

Automatically extracting stance information from natural language texts is a significant research problem with various applications, particularly after the recent explosion of…

229

Abstract

Purpose

Automatically extracting stance information from natural language texts is a significant research problem with various applications, particularly after the recent explosion of data on the internet via platforms like social media sites. Stance detection system helps determine whether the author agree, against or has a neutral opinion with the given target. Most of the research in stance detection focuses on the English language, while few research was conducted on the Arabic language.

Design/methodology/approach

This paper aimed to address stance detection on Arabic tweets by building and comparing different stance detection models using four transformers, namely: Araelectra, MARBERT, AraBERT and Qarib. Using different weights for these transformers, the authors performed extensive experiments fine-tuning the task of stance detection Arabic tweets with the four different transformers.

Findings

The results showed that the AraBERT model learned better than the other three models with a 70% F1 score followed by the Qarib model with a 68% F1 score.

Research limitations/implications

A limitation of this study is the imbalanced dataset and the limited availability of annotated datasets of SD in Arabic.

Originality/value

Provide comprehensive overview of the current resources for stance detection in the literature, including datasets and machine learning methods used. Therefore, the authors examined the models to analyze and comprehend the obtained findings in order to make recommendations for the best performance models for the stance detection task.

Details

Arab Gulf Journal of Scientific Research, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1985-9899

Keywords

Open Access
Article
Publication date: 15 September 2021

Elina Late and Sanna Kumpulainen

The paper examines academic historians' information interactions with material from digital historical-newspaper collections as the research process unfolds.

3160

Abstract

Purpose

The paper examines academic historians' information interactions with material from digital historical-newspaper collections as the research process unfolds.

Design/methodology/approach

The study employed qualitative analysis from in-depth interviews with Finnish history scholars who use digitised historical newspapers as primary sources for their research. A model for task-based information interaction guided the collection and analysis of data.

Findings

The study revealed numerous information interactions within activities related to task-planning, the search process, selecting and working with the items and synthesis and reporting. The information interactions differ with the activities involved, which call for system support mechanisms specific to each activity type. Various activities feature information search, which is an essential research method for those using digital collections in the compilation and analysis of data. Furthermore, application of quantitative methods and multidisciplinary collaboration may be shaping culture in history research toward convergence with the research culture of the natural sciences.

Originality/value

For sustainable digital humanities infrastructure and digital collections, it is of great importance that system designers understand how the collections are accessed, why and their use in the real-world context. The study enriches understanding of the collections' utilisation and advances a theoretical framework for explicating task-based information interaction.

Details

Journal of Documentation, vol. 78 no. 7
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 6 March 2017

Zhuoxuan Jiang, Chunyan Miao and Xiaoming Li

Recent years have witnessed the rapid development of massive open online courses (MOOCs). With more and more courses being produced by instructors and being participated by…

2117

Abstract

Purpose

Recent years have witnessed the rapid development of massive open online courses (MOOCs). With more and more courses being produced by instructors and being participated by learners all over the world, unprecedented massive educational resources are aggregated. The educational resources include videos, subtitles, lecture notes, quizzes, etc., on the teaching side, and forum contents, Wiki, log of learning behavior, log of homework, etc., on the learning side. However, the data are both unstructured and diverse. To facilitate knowledge management and mining on MOOCs, extracting keywords from the resources is important. This paper aims to adapt the state-of-the-art techniques to MOOC settings and evaluate the effectiveness on real data. In terms of practice, this paper also tries to answer the questions for the first time that to what extend can the MOOC resources support keyword extraction models, and how many human efforts are required to make the models work well.

Design/methodology/approach

Based on which side generates the data, i.e instructors or learners, the data are classified to teaching resources and learning resources, respectively. The approach used on teaching resources is based on machine learning models with labels, while the approach used on learning resources is based on graph model without labels.

Findings

From the teaching resources, the methods used by the authors can accurately extract keywords with only 10 per cent labeled data. The authors find a characteristic of the data that the resources of various forms, e.g. subtitles and PPTs, should be separately considered because they have the different model ability. From the learning resources, the keywords extracted from MOOC forums are not as domain-specific as those extracted from teaching resources, but they can reflect the topics which are lively discussed in forums. Then instructors can get feedback from the indication. The authors implement two applications with the extracted keywords: generating concept map and generating learning path. The visual demos show they have the potential to improve learning efficiency when they are integrated into a real MOOC platform.

Research limitations/implications

Conducting keyword extraction on MOOC resources is quite difficult because teaching resources are hard to be obtained due to copyrights. Also, getting labeled data is tough because usually expertise of the corresponding domain is required.

Practical implications

The experiment results support that MOOC resources are good enough for building models of keyword extraction, and an acceptable balance between human efforts and model accuracy can be achieved.

Originality/value

This paper presents a pioneer study on keyword extraction on MOOC resources and obtains some new findings.

Details

International Journal of Crowd Science, vol. 1 no. 1
Type: Research Article
ISSN: 2398-7294

Keywords

Abstract

Purpose

An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues.

Design/methodology/approach

This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material.

Findings

Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified.

Research limitations/implications

The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc.

Practical implications

Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field.

Social implications

The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.

Originality/value

This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

Open Access
Article
Publication date: 5 March 2021

Xuan Ji, Jiachen Wang and Zhijun Yan

Stock price prediction is a hot topic and traditional prediction methods are usually based on statistical and econometric models. However, these models are difficult to deal with…

16595

Abstract

Purpose

Stock price prediction is a hot topic and traditional prediction methods are usually based on statistical and econometric models. However, these models are difficult to deal with nonstationary time series data. With the rapid development of the internet and the increasing popularity of social media, online news and comments often reflect investors’ emotions and attitudes toward stocks, which contains a lot of important information for predicting stock price. This paper aims to develop a stock price prediction method by taking full advantage of social media data.

Design/methodology/approach

This study proposes a new prediction method based on deep learning technology, which integrates traditional stock financial index variables and social media text features as inputs of the prediction model. This study uses Doc2Vec to build long text feature vectors from social media and then reduce the dimensions of the text feature vectors by stacked auto-encoder to balance the dimensions between text feature variables and stock financial index variables. Meanwhile, based on wavelet transform, the time series data of stock price is decomposed to eliminate the random noise caused by stock market fluctuation. Finally, this study uses long short-term memory model to predict the stock price.

Findings

The experiment results show that the method performs better than all three benchmark models in all kinds of evaluation indicators and can effectively predict stock price.

Originality/value

In this paper, this study proposes a new stock price prediction model that incorporates traditional financial features and social media text features which are derived from social media based on deep learning technology.

Details

International Journal of Crowd Science, vol. 5 no. 1
Type: Research Article
ISSN: 2398-7294

Keywords

1 – 10 of 24