Search results

1 – 10 of over 17000
Open Access
Article
Publication date: 14 August 2017

Xiu Susie Fang, Quan Z. Sheng, Xianzhi Wang, Anne H.H. Ngu and Yihong Zhang

This paper aims to propose a system for generating actionable knowledge from Big Data and use this system to construct a comprehensive knowledge base (KB), called GrandBase.

2046

Abstract

Purpose

This paper aims to propose a system for generating actionable knowledge from Big Data and use this system to construct a comprehensive knowledge base (KB), called GrandBase.

Design/methodology/approach

In particular, this study extracts new predicates from four types of data sources, namely, Web texts, Document Object Model (DOM) trees, existing KBs and query stream to augment the ontology of the existing KB (i.e. Freebase). In addition, a graph-based approach to conduct better truth discovery for multi-valued predicates is also proposed.

Findings

Empirical studies demonstrate the effectiveness of the approaches presented in this study and the potential of GrandBase. The future research directions regarding GrandBase construction and extension has also been discussed.

Originality/value

To revolutionize our modern society by using the wisdom of Big Data, considerable KBs have been constructed to feed the massive knowledge-driven applications with Resource Description Framework triples. The important challenges for KB construction include extracting information from large-scale, possibly conflicting and different-structured data sources (i.e. the knowledge extraction problem) and reconciling the conflicts that reside in the sources (i.e. the truth discovery problem). Tremendous research efforts have been contributed on both problems. However, the existing KBs are far from being comprehensive and accurate: first, existing knowledge extraction systems retrieve data from limited types of Web sources; second, existing truth discovery approaches commonly assume each predicate has only one true value. In this paper, the focus is on the problem of generating actionable knowledge from Big Data. A system is proposed, which consists of two phases, namely, knowledge extraction and truth discovery, to construct a broader KB, called GrandBase.

Details

PSU Research Review, vol. 1 no. 2
Type: Research Article
ISSN: 2399-1747

Keywords

Article
Publication date: 3 August 2021

Irvin Dongo, Yudith Cardinale, Ana Aguilera, Fabiola Martinez, Yuni Quintero, German Robayo and David Cabeza

This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on…

Abstract

Purpose

This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations.

Design/methodology/approach

As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods.

Findings

The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web.

Originality/value

Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

Details

International Journal of Web Information Systems, vol. 17 no. 6
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 9 August 2021

Xintong Zhao, Jane Greenberg, Vanessa Meschke, Eric Toberer and Xiaohua Hu

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials…

Abstract

Purpose

The output of academic literature has increased significantly due to digital technology, presenting researchers with a challenge across every discipline, including materials science, as it is impossible to manually read and extract knowledge from millions of published literature. The purpose of this study is to address this challenge by exploring knowledge extraction in materials science, as applied to digital scholarship. An overriding goal is to help inform readers about the status knowledge extraction in materials science.

Design/methodology/approach

The authors conducted a two-part analysis, comparing knowledge extraction methods applied materials science scholarship, across a sample of 22 articles; followed by a comparison of HIVE-4-MAT, an ontology-based knowledge extraction and MatScholar, a named entity recognition (NER) application. This paper covers contextual background, and a review of three tiers of knowledge extraction (ontology-based, NER and relation extraction), followed by the research goals and approach.

Findings

The results indicate three key needs for researchers to consider for advancing knowledge extraction: the need for materials science focused corpora; the need for researchers to define the scope of the research being pursued, and the need to understand the tradeoffs among different knowledge extraction methods. This paper also points to future material science research potential with relation extraction and increased availability of ontologies.

Originality/value

To the best of the authors’ knowledge, there are very few studies examining knowledge extraction in materials science. This work makes an important contribution to this underexplored research area.

Details

The Electronic Library , vol. 39 no. 3
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 22 October 2021

Na Pang, Li Qian, Weimin Lyu and Jin-Dong Yang

In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been…

Abstract

Purpose

In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.

Design/methodology/approach

Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.

Findings

The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.

Originality/value

By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.

Details

Data Technologies and Applications, vol. 56 no. 2
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 14 May 2019

Ahsan Mahmood, Hikmat Ullah Khan, Zahoor Ur Rehman, Khalid Iqbal and Ch. Muhmmad Shahzad Faisal

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the named…

Abstract

Purpose

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the named entities in a computer readable text having an annotation of categorization tags for information extraction. NER is an active research area in information management and information retrieval systems. NER serves as a baseline for machines to understand the context of a given content and helps in knowledge extraction. Although NER is considered as a solved task in major languages such as English, in languages such as Urdu, NER is still a challenging task. Moreover, NER depends on the language and domain of study; thus, it is gaining the attention of researchers in different domains.

Design/methodology/approach

This paper proposes a knowledge extraction framework using finite-state transducers (FSTs) – KEFST – to extract the named entities. KEFST consists of five steps: content extraction, tokenization, part of speech tagging, multi-word detection and NER. An extensive empirical analysis using the data corpus of Urdu translation of Sahih Al-Bukhari, a widely known hadith book, reveals that the proposed method effectively recognizes the entities to obtain better results.

Findings

The significant performance in terms of f-measure, precision and recall validates that the proposed model outperforms the existing methods for NER in the relevant literature.

Originality/value

This research is novel in this regard that no previous work is proposed in the Urdu language to extract named entities using FSTs and no previous work is proposed for Urdu hadith data NER.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 11 June 2020

Yuh-Min Chen, Tsung-Yi Chen and Lyu-Cian Chen

Location-based services (LBS) have become an effective commercial marketing tool. However, regarding retail store location selection, it is challenging to collect analytical data…

Abstract

Purpose

Location-based services (LBS) have become an effective commercial marketing tool. However, regarding retail store location selection, it is challenging to collect analytical data. In this study, location-based social network data are employed to develop a retail store recommendation method by analyzing the relationship between user footprint and point-of-interest (POI). According to the correlation analysis of the target area and the extraction of crowd mobility patterns, the features of retail store recommendation are constructed.

Design/methodology/approach

The industrial density, area category, clustering and area saturation calculations between POIs are designed. Methods such as Kernel Density Estimation and K-means are used to calculate the influence of the area relevance on the retail store selection.

Findings

The coffee retail industry is used as an example to analyze the retail location recommendation method and assess the accuracy of the method.

Research limitations/implications

This study is mainly limited by the size and density of the datasets. Owing to the limitations imposed by the location-based privacy policy, it is challenging to perform experimental verification using the latest data.

Originality/value

An industrial relevance questionnaire is designed, and the responses are arranged using a simple checklist to conveniently establish a method for filtering the industrial nature of the adjacent areas. The New York and Tokyo datasets from Foursquare and the Tainan city dataset from Facebook are employed for feature extraction and validation. A higher evaluation score is obtained compared with relevant studies with regard to the normalized discounted cumulative gain index.

Details

Online Information Review, vol. 45 no. 2
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 24 June 2020

Yilu Zhou and Yuan Xue

Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and…

234

Abstract

Purpose

Strategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and has limited scope. This paper proposes a text-mining framework, ACRank, that automatically extracts alliances from news articles. ACRank aims to provide human analysts with a higher coverage of strategic alliances compared to existing databases, yet maintain a reasonable extraction precision. It has the potential to discover alliances involving less well-known companies, a situation often neglected by commercial databases.

Design/methodology/approach

The proposed framework is a systematic process of alliance extraction and validation using natural language processing techniques and alliance domain knowledge. The process integrates news article search, entity extraction, and syntactic and semantic linguistic parsing techniques. In particular, Alliance Discovery Template (ADT) identifies a number of linguistic templates expanded from expert domain knowledge and extract potential alliances at sentence-level. Alliance Confidence Ranking (ACRank)further validates each unique alliance based on multiple features at document-level. The framework is designed to deal with extremely skewed, noisy data from news articles.

Findings

In evaluating the performance of ACRank on a gold standard data set of IBM alliances (2006–2008) showed that: Sentence-level ADT-based extraction achieved 78.1% recall and 44.7% precision and eliminated over 99% of the noise in news articles. ACRank further improved precision to 97% with the top20% of extracted alliance instances. Further comparison with Thomson Reuters SDC database showed that SDC covered less than 20% of total alliances, while ACRank covered 67%. When applying ACRank to Dow 30 company news articles, ACRank is estimated to achieve a recall between 0.48 and 0.95, and only 15% of the alliances appeared in SDC.

Originality/value

The research framework proposed in this paper indicates a promising direction of building a comprehensive alliance database using automatic approaches. It adds value to academic studies and business analyses that require in-depth knowledge of strategic alliances. It also encourages other innovative studies that use text mining and data analytics to study business relations.

Details

Information Technology & People, vol. 33 no. 5
Type: Research Article
ISSN: 0959-3845

Keywords

Article
Publication date: 5 October 2015

Oduetse Matsebe, Khumbulani Mpofu, John Terhile Agee and Sesan Peter Ayodeji

The purpose of this paper is to present a method to extract corner features for map building purposes in man-made structured underwater environments using the sliding-window…

Abstract

Purpose

The purpose of this paper is to present a method to extract corner features for map building purposes in man-made structured underwater environments using the sliding-window technique.

Design/methodology/approach

The sliding-window technique is used to extract corner features, and Mechanically Scanned Imaging Sonar (MSIS) is used to scan the environment for map building purposes. The tests were performed with real data collected in a swimming pool.

Findings

The change in application environment and the use of MSIS present some important differences, which must be taken into account when dealing with acoustic data. These include motion-induced distortions, continuous data flow, low scan frequency and high noise levels. Only part of the data stored in each scan sector is important for feature extraction; therefore, a segmentation process is necessary to extract more significant information. To deal with continuous flow of data, data must be separated into 360° scan sectors. Although the vehicle is assumed to be static, there is a drift in both its rotational and translational motions because of currents in the water; these drifts induce distortions in acoustic images. Therefore, the bearing information and the current vehicle pose corresponding to the selected scan-lines must be stored and used to compensate for motion-induced distortions in the acoustic images. As the data received is very noisy, an averaging filter should be applied to achieve an even distribution of data points, although this is partly achieved through the segmentation process. On the selected sliding window, all the point pairs must pass the distance and angle tests before a corner can be initialised. This minimises mapping of outlier data points but can make the algorithm computationally expensive if the selected window is too wide. The results show the viability of this procedure under very noisy data. The technique has been applied to 50 data sets/scans sectors with a success rate of 83 per cent.

Research limitations/implications

MSIS gives very noisy data. There are limited sensorial modes for underwater applications.

Practical implications

The extraction of corner features in structured man-made underwater environments opens the door for SLAM systems to a wide range of applications and environments.

Originality/value

A method to extract corner features for map building purposes in man-made structured underwater environments is presented using the sliding-window technique.

Details

Journal of Engineering, Design and Technology, vol. 13 no. 4
Type: Research Article
ISSN: 1726-0531

Keywords

Article
Publication date: 9 November 2021

Shilpa B L and Shambhavi B R

Stock market forecasters are focusing to create a positive approach for predicting the stock price. The fundamental principle of an effective stock market prediction is not only…

Abstract

Purpose

Stock market forecasters are focusing to create a positive approach for predicting the stock price. The fundamental principle of an effective stock market prediction is not only to produce the maximum outcomes but also to reduce the unreliable stock price estimate. In the stock market, sentiment analysis enables people for making educated decisions regarding the investment in a business. Moreover, the stock analysis identifies the business of an organization or a company. In fact, the prediction of stock prices is more complex due to high volatile nature that varies a large range of investor sentiment, economic and political factors, changes in leadership and other factors. This prediction often becomes ineffective, while considering only the historical data or textural information. Attempts are made to make the prediction more precise with the news sentiment along with the stock price information.

Design/methodology/approach

This paper introduces a prediction framework via sentiment analysis. Thereby, the stock data and news sentiment data are also considered. From the stock data, technical indicator-based features like moving average convergence divergence (MACD), relative strength index (RSI) and moving average (MA) are extracted. At the same time, the news data are processed to determine the sentiments by certain processes like (1) pre-processing, where keyword extraction and sentiment categorization process takes place; (2) keyword extraction, where WordNet and sentiment categorization process is done; (3) feature extraction, where Proposed holoentropy based features is extracted. (4) Classification, deep neural network is used that returns the sentiment output. To make the system more accurate on predicting the sentiment, the training of NN is carried out by self-improved whale optimization algorithm (SIWOA). Finally, optimized deep belief network (DBN) is used to predict the stock that considers the features of stock data and sentiment results from news data. Here, the weights of DBN are tuned by the new SIWOA.

Findings

The performance of the adopted scheme is computed over the existing models in terms of certain measures. The stock dataset includes two companies such as Reliance Communications and Relaxo Footwear. In addition, each company consists of three datasets (a) in daily option, set start day 1-1-2019 and end day 1-12-2020, (b) in monthly option, set start Jan 2000 and end Dec 2020 and (c) in yearly option, set year 2000. Moreover, the adopted NN + DBN + SIWOA model was computed over the traditional classifiers like LSTM, NN + RF, NN + MLP and NN + SVM; also, it was compared over the existing optimization algorithms like NN + DBN + MFO, NN + DBN + CSA, NN + DBN + WOA and NN + DBN + PSO, correspondingly. Further, the performance was calculated based on the learning percentage that ranges from 60, 70, 80 and 90 in terms of certain measures like MAE, MSE and RMSE for six datasets. On observing the graph, the MAE of the adopted NN + DBN + SIWOA model was 91.67, 80, 91.11 and 93.33% superior to the existing classifiers like LSTM, NN + RF, NN + MLP and NN + SVM, respectively for dataset 1. The proposed NN + DBN + SIWOA method holds minimum MAE value of (∼0.21) at learning percentage 80 for dataset 1; whereas, the traditional models holds the value for NN + DBN + CSA (∼1.20), NN + DBN + MFO (∼1.21), NN + DBN + PSO (∼0.23) and NN + DBN + WOA (∼0.25), respectively. From the table, it was clear that the RMSRE of the proposed NN + DBN + SIWOA model was 3.14, 1.08, 1.38 and 15.28% better than the existing classifiers like LSTM, NN + RF, NN + MLP and NN + SVM, respectively, for dataset 6. In addition, he MSE of the adopted NN + DBN + SIWOA method attain lower values (∼54944.41) for dataset 2 than other existing schemes like NN + DBN + CSA(∼9.43), NN + DBN + MFO (∼56728.68), NN + DBN + PSO (∼2.95) and NN + DBN + WOA (∼56767.88), respectively.

Originality/value

This paper has introduced a prediction framework via sentiment analysis. Thereby, along with the stock data and news sentiment data were also considered. From the stock data, technical indicator based features like MACD, RSI and MA are extracted. Therefore, the proposed work was said to be much appropriate for stock market prediction.

Details

Kybernetes, vol. 52 no. 3
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 1 July 2014

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable…

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3
Type: Research Article
ISSN: 0033-0337

Keywords

1 – 10 of over 17000